rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
92 stars 19 forks source link

Numpy parquet faster #21

Closed rom1504 closed 2 years ago

rom1504 commented 2 years ago

faster but didn't solve memleak

rom1504 commented 2 years ago

Did solve memleak but this current mutex + dict implementation is complex and not very reliable Instead handle this in the loader function of the iteration : prepare file-> table mapping in advance And clean up when we don't need them using reference counting

rom1504 commented 2 years ago

That or remove the parallism completely from the parquet reading and instead do a simpler loader using the precomputed pieces to know what slices to read Worth benchmarking a few solutions for the parquet alone

Veldrovive commented 2 years ago

It might also be valuable to take into account down the line parallelization such as torch dataset workers. Maybe sequential parquet reading would be fine in that case.

rom1504 commented 2 years ago

done in another pr