[Fix/new feature] Add max_ram_usage_in_bytes parameter for better control of ram usage

victor-paltz commented 2 years ago

When reading a batch of embeddings of 24GB, the program memory usage could reach 125GB at peak time for 250MB embedding files. With this commit, we can now control the ram usage + We never use more ram than necessary when the embedding files are big.

Modification 1: Before: Now: max_piece_size = max(number of embedding for 50MB, batch size in MB). Indeed, we can't memory-map with fsspec on distant files (https://github.com/fsspec/filesystem_spec/issues/832), and we can't load an exact part of data without loading the whole file. The read_piece function would load 5 times a 250MB files to extract the 5 pieces of 50MB it contains, so it is better to set max_piece_size = batch_size when batch_size > max_piece_size
Modification 2: Before: parallel_pieces = max(math.ceil(batch_size / max_piece_size), 10) Now: parallel_pieces = min(round(max_ram_usage_in_bytes/max_piece_size), 50). For a batch requiring 600 pieces, we would have needed to read 600 files in the meantime! So we cap the limit to 50 max

rom1504 commented 2 years ago

I think doing https://github.com/rom1504/embedding-reader/pull/25 for the parquet reader will improve memory usage significantly

victor-paltz commented 2 years ago

I think doing #25 for the parquet reader will improve memory usage significantly

Ah nice! This PR avoids loading twice the same file if two pieces are in the same file 😎

If we were unlucky though, we would still have bad memory usage if all the pieces were coming from a different file (even if it is less likely). So I think the best thing would still be to increase the piece size to be bigger than the maximum number of embeddings of the biggest file. WDYT?

What is the advantage of having piece size < max file size? By taking bigger pieces we wouldn't even need to bother to read twice the same file. Is it because with memory mapping on disk the cost is zero and loading a huge file would have broken things?

rom1504 commented 2 years ago

for the numpy format and files that are remote, range requests in http (and similar in s3 and hdfs) allow to get many parts of the file in parallel which makes reading much faster for parquet indeed there is much less advantage indeed, as the smallest part that can actually be requests is a row group

https://github.com/rom1504/embedding-reader/pull/25 makes it possible to have parquet still be reasonably fast, but maybe a sequential reader for parquet would be just as fast, I'm not sure, would be interesting to benchmark

anyway I will check your PR

rom1504 commented 2 years ago

can you check if this is as fast as before ? (for example by running https://github.com/rom1504/embedding-reader/blob/main/examples/example_numpy_parquet.py ?)

victor-paltz commented 2 years ago

can you check if this is as fast as before ? (for example by running https://github.com/rom1504/embedding-reader/blob/main/examples/example_numpy_parquet.py ?)

I run the code on the example_parquet.py (so only the parquet reader), here are my results:

Configuration:

batch of 10^6
16800 files
file size 220MB
embedding size 512d

Previous version: (git checkout 8630c739765558692ce73773cc189036c7c7819f)

Nb pieces: 101000

I needed to lower the number of threads because there were 500 and each one was reopening a file
With 50 threads:The average time was 30h to read through the dataset
With 10 threads: The average time was 38h to read through the dataset
With 1 thread: The average time was +100h to read through the dataset

New version: (git checkout f2bc69a98b689c8d1777c59ffb0172ffa046f8ad)

_parquet reader with max_piece_size = max(50MB, batchsize, 1)

~28s to load batches of size (1000000, 512), total time: ~16h -> 18819 pieces (parallel_pieces=50)
~3s to load batches of size (100000, 512), total time: ~14h -> 37482 pieces (parallel_pieces=50) _parquet reader with max_piecesize = max(50MB, 1), max ram usage = 4GB
~28s to load batches of size (1000000, 512), total time: ~15h -> 100316 pieces (parallel_pieces=50)
~3s to load batches of size (100000, 512), total time: ~14h -> 105635 pieces (parallel_pieces=50) _parquet reader with max_piecesize = max(50MB, 1), max ram usage = 1GB
~28s to load batches of size (1000000, 512), total time: ~15h -> 100316 pieces (parallel_pieces=21 or 10)
~3s to load batches of size (100000, 512), total time: ~14h -> 105635 pieces (parallel_pieces=21 or 10) -> No impact on increasing the number of pieces (maybe even slightly faster? but not significative)

Conclusion

Forcing a piece size of batch_size doesn't give any uplift + is less efficient when using a numpy reader. As a consequence, it is better to keep the same code for everyone and do max_piece size = 50MB over max(50MB, batch_size).

hitchhicker commented 2 years ago

LGTM, could we merge it ? @rom1504

rom1504 commented 2 years ago

Sounds pretty good for the parquet reader. However I would like to check for numpy too as it has different performance characteristics (much faster)

I'll try and check that tonight. If you have time, you can also run it, the numpy data in the example is public (and fsspec handles http protocol)

rom1504 commented 2 years ago

sorry for the delay

so I'm trying this now. For numpy+parquet it works just the same as before (2M embedding/s on 16 cores)

for numpy 4M embedding/s same as before

it seems to be working

rom1504 commented 2 years ago

released as 1.5.0

rom1504 commented 2 years ago

I think it would still be worth it to do https://github.com/rom1504/embedding-reader/issues/26

rom1504 / embedding-reader