Improves data store handling by
1) not running data reading/writing steps on remote workers. This can slow down targets because relevant data must be transferred to workers running targets and in the case of steps that read the data store it's all the parquet files, I think.
2) rely on arrow's database-like connections more rather than collect()ing data early into dataframes. When writing to a partitioned data store, only the most recent partition needs to be pulled into memory and then overwritten.
Improves data store handling by 1) not running data reading/writing steps on remote workers. This can slow down
targets
because relevant data must be transferred to workers running targets and in the case of steps that read the data store it's all the parquet files, I think. 2) rely on arrow's database-like connections more rather thancollect()
ing data early into dataframes. When writing to a partitioned data store, only the most recent partition needs to be pulled into memory and then overwritten.