[Search] XOR Filters - Githubissues

JensUweUlrich commented 2 years ago

Question

Hi Seqan-Team,

I just recently read about XOR Filters and that they were superior to Bloom Filters in regards to memory usage and query time. Do you plan (or maybe already started) to implement the data structure for upcoming releases?

Thanks Jens

smehringer commented 2 years ago

Hi @JensUweUlrich,

Thanks for the heads up. AFAIK There is currently no plan to implement it. Could you paste the link to your reference that shows the superior performance? I'd be very interested to read it and maybe discuss if we can easily incorporate it in seqan3.

JensUweUlrich commented 2 years ago

Hi @smehringer

The paper describing XOR filters can be found here. I'm open for discussions on implementing them as an interleaved counterpart to IBFs if this would be interesting for you as well.

Best Jens

eseiler commented 2 years ago

There is also a nice, short explanation on Stack Overflow: https://stackoverflow.com/a/67527508 The answer also contains a link to some slides I haven't yet looked at.

My thoughts:

We have a pretty minimalistic hashing, XOR-Filter seem to assume more sophisticated hashing. This might affect the runtime.
I feel like efficiently interleaving blocks of length L is not trivial, and would require a bit of tinkering and ideally some SIMD.
Construction is kinda heuristic - but that's not really a big deal.
XOR-Filters are not mutable and require all to-be-inserted values to be known at construction. Personally, this is a deal-breaker, because I need a mutable index structure; and either storing all values to be used for the construction or re-computing them in case of a recursion sounds inefficient.

As always, it really depends on one's needs 😁 If the data is static, it might be worth a try.

JensUweUlrich commented 2 years ago

Thanks for your answer @eseiler

I have some applications in mind where all elements are known at the time of construction, e.g. trying to answer if a sequenced read is part of a given set of reference sequences. In such scenarios the XOR filter could be useful. I think I will give it a try and implement it. Cheers

smehringer commented 2 years ago

Hi @JensUweUlrich,

I think for the use case of a fixed reference set XOR filters seem like a good alternative. If you haven't set your mind on a design for implementing the XOR filters, it would be great you orient yourself at our seqan3::interleaved_bloom_filter. Then we could integrate the XOR filter easily if benchmarks show the superiority.

For example it would be nice if the following code still works if we just plug in ulrich::xor_filter for seqan3::interleaved_bloom_filter :)

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>

int main()
{
    seqan3::interleaved_bloom_filter ibf{seqan3::bin_count{12u}, seqan3::bin_size{8192u}};
    ibf.emplace(126, seqan3::bin_index{0u});
    ibf.emplace(712, seqan3::bin_index{3u});
    ibf.emplace(237, seqan3::bin_index{9u});

    auto agent = ibf.membership_agent();
    auto & result = agent.bulk_contains(712);
    seqan3::debug_stream << result << '\n'; // prints [0,0,0,1,0,0,0,0,0,0,0,0]
}

If you have any questions or facing problems with this design I'd be happy to help.

JensUweUlrich commented 2 years ago

Hi @smehringer

that was my plan as well. Would be a mess to implement and not share it with the whole seqan community. Thanks for offering your help. I will get back to you as soon as I have a working implementation.

Cheers

JensUweUlrich commented 2 years ago

Hi @smehringer & @eseiler

I finished a first naive implementation of the interleaved XOR filter (IXF) and pushed it to my forked seqan3 repo (https://github.com/JensUweUlrich/seqan3). You can have a glance at the code and test it if you like to.

Some remarks:

As mentioned above, you should have knowledge of the number of bins and maximum elements per bin before creating an IXF object, then you can add the sets of keys
The construction time for the IXF is significantly longer than for the IBF, which is not suprising.
You can create the IXF with 8bit or 16bit fingerprints, where using the 8bit version leads to a false positive rate (FPR) of 0.0039.
Compared to an IBF with the same FPR, the IXF is significantly smaller in size (e.g. storing 100 bins each having 1,000,000 keys, results in 117 vs. 267 Mbytes)
The query performance for the IXF is worse than that of the IBF. This is expected since you can evaluate 64 bins in parallel for the IBF, but only 8 for the IXF with the 8bit version

At the end of the day, the IXF can be useful for bigger datasets and applications where low FPR is required, but query time is not crucial. Hope this is useful for some of you.

Cheers Jens

PS: I also have an experimental implementation of an interleaved Binary Fuse Filter. But it fails construction for bin numbers higher than 300, which is due to the intensive use of hashing and resetting of seeds for different bins. Maybe I will find some time to improve it in the near future.

smehringer commented 2 years ago

Nice work!

I have some questions:

For a fixed fingerprint, does the FPR vary based on your data and number of bins or is it always 0.0039? (S.t. I could increase speed by allowing a higher FPR)
How does the runtime scale with increasing number of bins? We notived with the IBF doesn't scale well so we are developing the hierarchical IBF (HIBF).
The query performance for the IXF is worse than that of the IBF. This is expected since you can evaluate 64 bins in parallel for the IBF, but only 8 for the IXF with the 8bit version

Can you use a 64 fingerprint?

At the end of the day, the IXF can be useful for bigger datasets and applications where low FPR is required, but query time is not crucial. Hope this is useful for some of you.

The reduction in size is indeed interesting. It's about half of that of an IBF. Could you interpolate if this reduction would be expected to be independent of the number of bins and the data (it's kmer distribution)?

We currently reduce the size of the IBF with minimizers which works quite well. So I don't know if we are in need of XOR filters and if they can compete with tools like mantis if the performance is low.

JensUweUlrich commented 2 years ago

For a fixed fingerprint, does the FPR vary based on your data and number of bins or is it always 0.0039? (S.t. I could increase speed by allowing a higher FPR)

The FPR depends on the used fingerprint size. That said, the higher the fingerprint size, the lower the FPR. But it is independent on the number of bins

How does the runtime scale with increasing number of bins? We noticed with the IBF doesn't scale well so we are developing the hierarchical IBF (HIBF).

Same issues here. More bins lead to decreased performance. Would be nice to see the HIBF implementation, so I could think about a hierarchical IXF, too.

Can you use a 64 fingerprint?

Possible, would lead to a minimum FPR but you would lose parallel querying of bins.

Could you interpolate if this reduction would be expected to be independent of the number of bins and the data (it's kmer distribution)?

The reduction is independent of the number of bins or data.

We currently reduce the size of the IBF with minimizers which works quite well

Downsampling or subsampling of entries by using minimizer or syncmers is indeed a good approach but there's definitely a limit for that ( 20% of k-mers if I remember correctly). And as I said, for really large sets of references, the IXF could be an alternative. BTW: Have you already benchmarked the IBF to mantis?

eseiler commented 2 years ago

How does the runtime scale with increasing number of bins? We noticed with the IBF doesn't scale well so we are developing the hierarchical IBF (HIBF).

Same issues here. More bins lead to decreased performance. Would be nice to see the HIBF implementation, so I could think about a hierarchical IXF, too.

Regarding runtime, I had a glance at the code, and in a few (> 1) places you do something like

for (std::vector<...> element : vector_of_vector)

Maybe it was auto element, still same issue. This should be

for (std::vector<...> const & element : vector_of_vector)

Otherwise, you'll copy the vectors, but you only read (not write) them; so no copy needed.

But now I'm kinda hooked, maybe I'll get around to making the code seqan3 style (you have some code from the original implementation, and they are kinda c-style c++).

JensUweUlrich commented 2 years ago

To be honest, I did not invest too much time in code optimization. I'm sure you will find some places in the code where you can tweak performance. But theoretically, the IXF can never reach faster queries than the IBF with more than 8 bins. But if you find it useful or have some ideas for improvement, we can discuss further. I'm happy to help you make seqan even better ;-)

seqan / seqan3

[Search] XOR Filters #2925

Question