rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
92 stars 19 forks source link

[Fix/new feature] Add max_ram_usage_in_bytes parameter for better control of ram usage #30

Closed victor-paltz closed 2 years ago

victor-paltz commented 2 years ago

When reading a batch of embeddings of 24GB, the program memory usage could reach 125GB at peak time for 250MB embedding files. With this commit, we can now control the ram usage + We never use more ram than necessary when the embedding files are big.

rom1504 commented 2 years ago

I think doing https://github.com/rom1504/embedding-reader/pull/25 for the parquet reader will improve memory usage significantly

victor-paltz commented 2 years ago

I think doing #25 for the parquet reader will improve memory usage significantly

Ah nice! This PR avoids loading twice the same file if two pieces are in the same file 😎

If we were unlucky though, we would still have bad memory usage if all the pieces were coming from a different file (even if it is less likely). So I think the best thing would still be to increase the piece size to be bigger than the maximum number of embeddings of the biggest file. WDYT?

What is the advantage of having piece size < max file size? By taking bigger pieces we wouldn't even need to bother to read twice the same file. Is it because with memory mapping on disk the cost is zero and loading a huge file would have broken things?

rom1504 commented 2 years ago

for the numpy format and files that are remote, range requests in http (and similar in s3 and hdfs) allow to get many parts of the file in parallel which makes reading much faster for parquet indeed there is much less advantage indeed, as the smallest part that can actually be requests is a row group

https://github.com/rom1504/embedding-reader/pull/25 makes it possible to have parquet still be reasonably fast, but maybe a sequential reader for parquet would be just as fast, I'm not sure, would be interesting to benchmark

anyway I will check your PR

rom1504 commented 2 years ago

can you check if this is as fast as before ? (for example by running https://github.com/rom1504/embedding-reader/blob/main/examples/example_numpy_parquet.py ?)

victor-paltz commented 2 years ago

can you check if this is as fast as before ? (for example by running https://github.com/rom1504/embedding-reader/blob/main/examples/example_numpy_parquet.py ?)

I run the code on the example_parquet.py (so only the parquet reader), here are my results:

Configuration:

Previous version: (git checkout 8630c739765558692ce73773cc189036c7c7819f)

Nb pieces: 101000

New version: (git checkout f2bc69a98b689c8d1777c59ffb0172ffa046f8ad)

_parquet reader with max_piece_size = max(50MB, batchsize, 1)

Conclusion

Forcing a piece size of batch_size doesn't give any uplift + is less efficient when using a numpy reader. As a consequence, it is better to keep the same code for everyone and do max_piece size = 50MB over max(50MB, batch_size).

hitchhicker commented 2 years ago

LGTM, could we merge it ? @rom1504

rom1504 commented 2 years ago

Sounds pretty good for the parquet reader. However I would like to check for numpy too as it has different performance characteristics (much faster)

I'll try and check that tonight. If you have time, you can also run it, the numpy data in the example is public (and fsspec handles http protocol)

rom1504 commented 2 years ago

sorry for the delay

so I'm trying this now. For numpy+parquet it works just the same as before (2M embedding/s on 16 cores)

for numpy 4M embedding/s same as before

it seems to be working

rom1504 commented 2 years ago

released as 1.5.0

rom1504 commented 2 years ago

I think it would still be worth it to do https://github.com/rom1504/embedding-reader/issues/26