hashlookup / poppy

Rust implementation of the DCSO Bloom filter
BSD 3-Clause "New" or "Revised" License
26 stars 0 forks source link

Creating filters, large files and memory usage #6

Closed adulau closed 1 month ago

adulau commented 2 months ago

While running the following command to create a Poppy filter from a file of 146GB:

./poppy -v -j 2 create -p 0.001 rockyou2024.pop rockyou2024.txt

The memory is exhausted and the process is killed by the Kernel:

Out of memory: Killed process 2467222 (poppy) total-vm:17534144kB, anon-rss:14868572kB, file-rss:0kB, shmem-rss:0kB, UID:1006 pgtables:34088kB oom_score_adj:0

It seems the issue is larger than the just the counting part. Creating a Poppy filter with the following parameters kill a 16GB server ./poppy create -c 9948575739 -p 0.001 rockyou2024.pop

I bet it's not possible to create Bloom filter on system with less memory than the total filter size.

qjerome commented 1 month ago

Yes I think the issue comes from the fact that poppy is eating up all the RAM. I think this is the same issue with any data structure becoming too big in memory.

It is very likely possible to find a solution to this issue but probably at the cost of a significant performance drop.

adulau commented 1 month ago

I think the issue is more in the generic command `poppy -j 0 create -p 0.001 /path/to/output/filter.pop /path/to/dataset/*.txt as it seems to load the complete source file instead of streaming it line-by-line.