Is append loading the entire data into memory just to append new data ?

based on this code , on each append we load all the data into memory to check for duplicates then doing a write on all the data to rewrite parquet. doing that for some items with 100k existing record with multiple threads, the task is consuming 100% of memory for each 1 record append

why not use fastparquet write method to append the data, (with True / False / overwrite) https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write

 try:
          if epochdate or ("datetime" in str(data.index.dtype) and
                           any(data.index.nanosecond) > 0):
              data = utils.datetime_to_int64(data)
          old_index = dd.read_parquet(self._item_path(item, as_string=True),
                                      columns=[], engine=self.engine
                                      ).index.compute()
          data = data[~data.index.isin(old_index)]
      except Exception:
          return

      if data.empty:
          return

      if data.index.name == "":
          data.index.name = "index"

      # combine old dataframe with new
      current = self.item(item)
      new = dd.from_pandas(data, npartitions=1)
      combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

ranaroussi / pystore

Is append loading the entire data into memory just to append new data ? #56