ranaroussi / pystore

Fast data store for Pandas time-series data
Apache License 2.0
556 stars 99 forks source link

item.append doesn't pass npartitions to item.write #36

Open JugglingNumbers opened 4 years ago

JugglingNumbers commented 4 years ago

When using item.append(item, new_data, npartitions=35) the write function is passed npartitions = None. Should be npartitions=npartitions https://github.com/ranaroussi/pystore/blob/40de1d51236fd6b6b88909c83dc6d7297de4b471/pystore/collection.py#L194-L196

ancher1912 commented 4 years ago

Hmm...changing this just result in a similar exception at line 182. I've changed it into:

new_npartitions = npartitions
if new_npartitions is None:
    memusage = data.memory_usage(deep=True).sum()
    new_npartitions = int(1 + memusage // DEFAULT_PARTITION_SIZE)

# combine old dataframe with new
current = self.item(item)
new = dd.from_pandas(data, npartitions=new_npartitions)

That seems to work at a first glance.

JugglingNumbers commented 4 years ago

@ancher1912 yup you've encountered the other append error: https://github.com/ranaroussi/pystore/issues/31

The other option is just to change the last line of your blurb to new= dd.from_pandas(data, npartitions=1

since the combined dask dataframe is partitioned by the variable npartitions it doesn't matter if we only use one partition when converting the new dataframe to dask.

ancher1912 commented 4 years ago

Yeah, you're right. I've send a pull request to @ranaroussi with you're proposed changed. At least I can continue doing what I was doing before I did an update of Dask and FastParquet.