Open JensUweUlrich opened 2 years ago
Hi @JensUweUlrich,
Thanks for the heads up. AFAIK There is currently no plan to implement it. Could you paste the link to your reference that shows the superior performance? I'd be very interested to read it and maybe discuss if we can easily incorporate it in seqan3.
Hi @smehringer
The paper describing XOR filters can be found here. I'm open for discussions on implementing them as an interleaved counterpart to IBFs if this would be interesting for you as well.
Best Jens
There is also a nice, short explanation on Stack Overflow: https://stackoverflow.com/a/67527508 The answer also contains a link to some slides I haven't yet looked at.
My thoughts:
L
is not trivial, and would require a bit of tinkering and ideally some SIMD.As always, it really depends on one's needs 😁 If the data is static, it might be worth a try.
Thanks for your answer @eseiler
I have some applications in mind where all elements are known at the time of construction, e.g. trying to answer if a sequenced read is part of a given set of reference sequences. In such scenarios the XOR filter could be useful. I think I will give it a try and implement it. Cheers
Hi @JensUweUlrich,
I think for the use case of a fixed reference set XOR filters seem like a good alternative.
If you haven't set your mind on a design for implementing the XOR filters, it would be great you orient yourself at our seqan3::interleaved_bloom_filter
. Then we could integrate the XOR filter easily if benchmarks show the superiority.
For example it would be nice if the following code still works if we just plug in ulrich::xor_filter
for seqan3::interleaved_bloom_filter
:)
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>
int main()
{
seqan3::interleaved_bloom_filter ibf{seqan3::bin_count{12u}, seqan3::bin_size{8192u}};
ibf.emplace(126, seqan3::bin_index{0u});
ibf.emplace(712, seqan3::bin_index{3u});
ibf.emplace(237, seqan3::bin_index{9u});
auto agent = ibf.membership_agent();
auto & result = agent.bulk_contains(712);
seqan3::debug_stream << result << '\n'; // prints [0,0,0,1,0,0,0,0,0,0,0,0]
}
If you have any questions or facing problems with this design I'd be happy to help.
Hi @smehringer
that was my plan as well. Would be a mess to implement and not share it with the whole seqan community. Thanks for offering your help. I will get back to you as soon as I have a working implementation.
Cheers
Hi @smehringer & @eseiler
I finished a first naive implementation of the interleaved XOR filter (IXF) and pushed it to my forked seqan3 repo (https://github.com/JensUweUlrich/seqan3). You can have a glance at the code and test it if you like to.
Some remarks:
At the end of the day, the IXF can be useful for bigger datasets and applications where low FPR is required, but query time is not crucial. Hope this is useful for some of you.
Cheers Jens
PS: I also have an experimental implementation of an interleaved Binary Fuse Filter. But it fails construction for bin numbers higher than 300, which is due to the intensive use of hashing and resetting of seeds for different bins. Maybe I will find some time to improve it in the near future.
Nice work!
I have some questions:
For a fixed fingerprint, does the FPR vary based on your data and number of bins or is it always 0.0039? (S.t. I could increase speed by allowing a higher FPR)
How does the runtime scale with increasing number of bins? We notived with the IBF doesn't scale well so we are developing the hierarchical IBF (HIBF).
The query performance for the IXF is worse than that of the IBF. This is expected since you can evaluate 64 bins in parallel for the IBF, but only 8 for the IXF with the 8bit version
Can you use a 64 fingerprint?
At the end of the day, the IXF can be useful for bigger datasets and applications where low FPR is required, but query time is not crucial. Hope this is useful for some of you.
The reduction in size is indeed interesting. It's about half of that of an IBF. Could you interpolate if this reduction would be expected to be independent of the number of bins and the data (it's kmer distribution)?
We currently reduce the size of the IBF with minimizers which works quite well. So I don't know if we are in need of XOR filters and if they can compete with tools like mantis if the performance is low.
For a fixed fingerprint, does the FPR vary based on your data and number of bins or is it always 0.0039? (S.t. I could increase speed by allowing a higher FPR)
The FPR depends on the used fingerprint size. That said, the higher the fingerprint size, the lower the FPR. But it is independent on the number of bins
How does the runtime scale with increasing number of bins? We noticed with the IBF doesn't scale well so we are developing the hierarchical IBF (HIBF).
Same issues here. More bins lead to decreased performance. Would be nice to see the HIBF implementation, so I could think about a hierarchical IXF, too.
Can you use a 64 fingerprint?
Possible, would lead to a minimum FPR but you would lose parallel querying of bins.
Could you interpolate if this reduction would be expected to be independent of the number of bins and the data (it's kmer distribution)?
The reduction is independent of the number of bins or data.
We currently reduce the size of the IBF with minimizers which works quite well
Downsampling or subsampling of entries by using minimizer or syncmers is indeed a good approach but there's definitely a limit for that ( 20% of k-mers if I remember correctly). And as I said, for really large sets of references, the IXF could be an alternative. BTW: Have you already benchmarked the IBF to mantis?
How does the runtime scale with increasing number of bins? We noticed with the IBF doesn't scale well so we are developing the hierarchical IBF (HIBF).
Same issues here. More bins lead to decreased performance. Would be nice to see the HIBF implementation, so I could think about a hierarchical IXF, too.
Regarding runtime, I had a glance at the code, and in a few (> 1) places you do something like
for (std::vector<...> element : vector_of_vector)
Maybe it was auto element
, still same issue.
This should be
for (std::vector<...> const & element : vector_of_vector)
Otherwise, you'll copy the vectors, but you only read (not write) them; so no copy needed.
But now I'm kinda hooked, maybe I'll get around to making the code seqan3 style (you have some code from the original implementation, and they are kinda c-style c++).
To be honest, I did not invest too much time in code optimization. I'm sure you will find some places in the code where you can tweak performance. But theoretically, the IXF can never reach faster queries than the IBF with more than 8 bins. But if you find it useful or have some ideas for improvement, we can discuss further. I'm happy to help you make seqan even better ;-)
Question
Hi Seqan-Team,
I just recently read about XOR Filters and that they were superior to Bloom Filters in regards to memory usage and query time. Do you plan (or maybe already started) to implement the data structure for upcoming releases?
Thanks Jens