IBF and HIBF - Githubissues

lutfia95 commented 1 year ago

Hi Seqan-Team,

I have tested the ability of IBF to store many small protein sequences. I use the alphabet seqan3::aa27, the inserted sequences are not very long (maximum 13556 amino acids), but the number of bins is relatively large (683337 and the RVDB with 34731189 sequences). Since I am classifying peptide sequences, I need to insert the sequences into 683337 (or 34731189 for RVDB) bins. The computed bin size is about (55.31KiB) corresponding to the longest sequence, since each bin has the same size. This process results in very high memory footprint since I am dealing with a lot of bins. Do you recommend using HIBF in this case? Actually, I am not interested in where I have the k-mers (e.g. in FM index), the only thing I would like to classify is each peptide into a specific bin. I thought about using minimizers, but I don't think it will solve the problem.

Best, Ahmad

smehringer commented 1 year ago

Hi @lutfia95,

Thanks for reaching out!

Indeed the HIBF might be better suited when dealing with many bins. But: Whether the HIBF can reduce the memory footprint depends on the size distribution of your input. If it is skewed (max and min size differ largely) chances are good. Worst case ( all sequences equal in size), it could double the space requirements. It will speed up your queries either way.

You can try out the HIBF here: https://github.com/seqan/hibf

The HIBF library is not released yet but we think it should be already ready to use. Please get back to us with any questions, remarks or bugs.

The HIBF will need to compute a layout. For the amount of bins you have, we recommend setting the config option seqan::hibf::config::disable_rearrangement to true. Rearrangement can improve the layout as similar input sequences might be merged in the same bins leading to smaller total sizes. But currently it doesn't scale well to many user bins.

Best, Svenja

lutfia95 commented 1 year ago

Hi Svenja,

thank you very much for your reply, I will check the HIBF implementation and look out how can I use it. Is it also possible to use the same bin insertion concept with HIBF? In IBF, we insert the sequences into specific bin e.g. ibf.emplace(value, seqan3::bin_index{4u});. I would like to avoid reading the sequences from file, basically the sequences are read first from file, then many sequences are processed, and I want to skip the part of writing them to file then reading them again for HIBF insertion by just inserting them into the HIBF, which will reduce the run time and the usage of memory.

Best, Ahmad

smehringer commented 1 year ago

Hi Ahmad,

@eseiler will hopefully have time to add some example snippets tomorrow to ease the transition.

The bad news: The construction is different from the IBF in Seqan3. And it's a little more complex.

The good news: You should have similar flexibility and you are definitely not required to provide files.

Best, Svenja

lutfia95 commented 1 year ago

Hi Svenja,

the bad news sound not that bad, I will go with the good news ;)

P.S: I will close the issue as answered!

Best, Ahmad

smehringer commented 1 year ago

Something along these lines you can try:

#include <iostream>   // for operator<<, basic_ostream, char_traits, cout
#include <vector>     // for vector

#include <hibf/config.hpp>                                // for config, insert_iterator
#include <hibf/hierarchical_interleaved_bloom_filter.hpp> // for hierarchical_interleaved_bloom_filter

int main()
{

    std::vector<std::vector<seqan3::aa27>> protein_sequences{...};

    auto get_user_bin_data = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
    {
        for (auto protein_seq : protein_sequences[user_bin_id])
        {
            for (uint64_t hash : protein_seq | hash_me)
                it = hash;
        }
    };

    seqan::hibf::config config{.input_fn = get_user_bin_data, // required
                               .number_of_user_bins = 3u,     // required
                               .number_of_hash_functions = 2u,// recommended
                               .maximum_false_positive_rate = 0.05, // what suits your data?
                               .threads = 1u};

    seqan::hibf::hierarchical_interleaved_bloom_filter hibf{config};

    // Now we can search for some query, suppose we already have the hash values:
    std::vector<uint64_t> query1{3u, 9u, 12u, 14u};

    auto agent = hibf.membership_agent();
    auto & result1 = agent.membership_for(query1, 2u); // Note that we bind the result with a `&` to avoid copies!

    for (int64_t hit_user_bin : result1)
        std::cout << hit_user_bin << ' ';
    std::cout << '\n';
}

lutfia95 commented 1 year ago

Hey Svenja,

thank you very much for providing the example! I have two questions in between:

Regarding the hashing adapter, can I use the usual one (e.g. auto hash_me = seqan3::views::kmer_hash(seqan3::ungapped{kmer_size});). ´?
Should the sequences be inserted into the std::vector? or can I insert them during other loop in the user bin? Storing them all inside the std::vector might be RAM costly in case of many sequences.
If not, can I actually use the HIBF in a case such as having an vector of vectors? e.g. std::vector<std::vector<std::vector<seqan3::aa27>>> protein_sequences{{"AAH"_aa27, "AHY"_aa27}, {"LNA"_aa27, "NAL"_aa27, "ALN"_aa27}}; . Where {"AAH"_aa27, "AHY"_aa27} is the first user bin and {"LNA"_aa27, "NAL"_aa27, "ALN"_aa27}} is the second one. They are the k-mers already proceeded and ready to be inserted inside the filter.

Best, Ahmad

smehringer commented 1 year ago

Hey Svenja,

thank you very much for providing the example! I have two questions in between:
* Regarding the hashing adapter, can I use the usual one (e.g. `auto hash_me = seqan3::views::kmer_hash(seqan3::ungapped{kmer_size});`). ´?

Yes. Whatever worked for you before with the IBF. The only hard requirement we have, is that the hash values are convertible to uint64_t (which is the case for kmer_hash).

* Should the sequences be inserted into the `std::vector`? or can I insert them during other loop in the user bin? Storing them all inside the `std::vector` might be RAM costly in case of many sequences.

No! You do not need to store them in a vector. Good question though, I'll keep that in mind for documenting. Basically, you can do whatever you want inside the lambda. You just need to know which user bin / protein-sequence refers to user_bin_id and then assign whatever values you deem correct (convertible to uint64_t) into the HIBF via it = value/it = hash. You could even read a file within the body of the lambda. If you need help, you can explain your use case to me.

* If not, can I actually use the HIBF in a case such as having an vector of vectors? e.g. `std::vector<std::vector<std::vector<seqan3::aa27>>> protein_sequences{{"AAH"_aa27, "AHY"_aa27}, {"LNA"_aa27, "NAL"_aa27, "ALN"_aa27}};` . Where `{"AAH"_aa27, "AHY"_aa27}` is the first user bin and `{"LNA"_aa27, "NAL"_aa27, "ALN"_aa27}}` is the second one. They are the k-mers already proceeded and ready to be inserted inside the filter.

See above. It just needs some thinking about the lambda design.

Best, Ahmad

Best, Svenja

eseiler commented 1 year ago

I moved this issue to the hibf repo. I reopened it, so we can discuss your use case until we find a solution that works.

We also added a verbose snippet to the README, which might help understanding the general idea. (The verbose snippet will be swapped for a shorter snippet once we have documentation to link to for detailed explanations.)

And we are in the process of adding proper documentation :)

lutfia95 commented 1 year ago

@smehringer @eseiler thank you very much, I will test it today's evening and let you know if I need other support.

Best, Ahmad

lutfia95 commented 1 year ago

Hey @smehringer and @eseiler,

Here is an short example about my use case, inside the lambda function I am generating the k-mers with the corresponding modified version, the iterator is then holding the hash value of each k-mer (either original or modified). The snippet worked well for small fasta file (couple of protein sequences), and I want to test it then on different scenario. It would be great if you could give me more recommendations on how to improve the build/insert part.

using traits_type = seqan3::sequence_file_input_default_traits_aa;
seqan3::sequence_file_input<traits_type> fin{inputFastaFile};
std::vector<seqan3::aa27_vector> user_bins;

for (auto & record : fin)
{
    user_bins.push_back(record.sequence());
}

auto my_input = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
{
    auto const & seq = user_bins[user_bin_id];
    std::string sequence = "";

    for (auto const & c : seq)
        sequence += seqan3::to_char(c);

    const unsigned numberOfKMers = (sequence.length() - kmerSize + 1);

    for (unsigned i = 0; i < numberOfKMers; i++)
    {

        const std::string kmer = sequence.substr(i, kmerSize);
        TotalAcceptedPM possibleMutations = this->mutationConf(kmer, threshold);

        for (const auto & mutatedKmer : possibleMutations.kmers)
        {
            // convert std::string to seqan3::a27_vector
            seqan3::aa27_vector kmerHolder{};
            for (const char & c : mutatedKmer.first)
            {

                seqan3::aa27 aminoAcid = {};
                aminoAcid.assign_char(c);
                kmerHolder.push_back(aminoAcid);
            }
            auto hashValue = this->hashKmer(kmerHolder, kmerSize);
            it = hashValue;
        }
    }
};

seqan::hibf::config config{.input_fn = my_input,                    // required
                           .number_of_user_bins = user_bins.size(), // required
                           .number_of_hash_functions = 2,
                           .maximum_false_positive_rate = 0.01, // recommended to adapt
                           .threads = threads,                  // recommended to adapt
                           .sketch_bits = 12,
                           .tmax = 0, // triggers default copmutation
                           .alpha = 1.2,
                           .max_rearrangement_ratio = 0.5,
                           .disable_estimate_union = false,
                           .disable_rearrangement = false};

seqan::hibf::hierarchical_interleaved_bloom_filter hibf{config};

Best, Ahmad

eseiler commented 1 year ago

From the HIBF side, this looks good.

From the application side, you might want to try to reduce the number of copies (aa27 -> string -> aa27). But this depends on how your functions look like.
For example,

if mutationConf takes std::string_view, you could use const std::string_view kmer = std::string_view{sequence}.substr(i, kmerSize); which doesn't copy.

for (const auto & mutatedKmer : possibleMutations.kmers)
{
    // convert std::string to seqan3::a27_vector
    seqan3::aa27_vector kmerHolder{};
    for (const char & c : mutatedKmer.first)
    {

        seqan3::aa27 aminoAcid = {};
        aminoAcid.assign_char(c);
        kmerHolder.push_back(aminoAcid);
    }
    auto hashValue = this->hashKmer(kmerHolder, kmerSize);
    it = hashValue;
}

could maybe use views (https://docs.seqan.de/seqan3/3.3.0/cookbook.html#cookbook_convert_alphabet_range)

for (const auto & mutatedKmer : possibleMutations.kmers)
{
    auto sequence_view = mutatedKmer.first | std::views::transform(
       [](const char in)
       {
           return static_cast<seqan3::aa27>(in);
       });
    auto hashValue = this->hashKmer(sequence_view , kmerSize);
    it = hashValue;
}

which would also avoid copies, but requires hashKmer to also accept a view.

Depending on what else you do, it might also be possible to do everything with either aa27 or char such that you don't have to convert. seqan3::sequence_file_input can be told to, e.g., parse the input as char but still check that the characters are valid aa27.

lutfia95 commented 1 year ago

Hey @eseiler,

thank you very much, I already changed to string_view. The multithreading part of HIBF is only for the insertion/construction part? or also for the lambda function? I will run the tests today on the complete protein sequences sets.

Best, Ahmad

eseiler commented 1 year ago

The config::input_fn may be called in parallel (by config::threads many threads) and the config::input_fn will be called twice for each user bin (once for sketch/layout and once for insertion into the HIBF).

lutfia95 commented 1 year ago

Hey @eseiler,

I think the memory usage in my case was very high due to the wrong parameter selection for the HIBF configuration. I already tested some scenarios: k:3 generates 164761541 k-mers from 683337 sequences (refSeqViral), for k:4 are 164078204 k-mers ...etc. The memory usage for `k:3 ' was about 78 GB with two hash functions. Another test with only IBF showed good results regarding memory usage. It can't be that IBF uses less memory compared to HIBF, as my user bins differ in size. I will test other parameters for the HIBF configuration and will let you know how the parameter chosen will affect the build/query time.

lutfia95 commented 1 year ago

Hey,

I could solve the problem by splitting the input reference sequences in parts, instead of generating one HIBF of 1 reference set with 683337 samples, I can use 6 HIBF's. I will close the issue. Best.

smehringer commented 1 year ago

Hi @lutfia95,

thanks for letting us know. So the main issue was that the HIBF took too much memory, so you split up the reference? We might be working on a partitioned HIBF so I'm asking out of interest.

It can't be that IBF uses less memory compared to HIBF, as my user bins differ in size.

I would agree, but I cannot say for sure.

From my experiments, tmax has the highest impact on the size of the HIBF. What was you tmax?

Svenja

lutfia95 commented 1 year ago

Hi @smehringer,

memory usage was really hight in the case of rearrangement and union estimation (disable_estimate_union, disable_rearrangement) This problem is solved with disabling both. Second problem is the layout computation time. In the case of using the set of 683337 user bins, this caused very large run time for the layout computation. I didn't have really much time to check the complete implementation from it. So I could solve this problem by splitting the reference set into parts, first I tested different scenarios:

1- 100000 user bins: 30 threads, 71,861 minutes with tmax = 320.
2- 200000 user bins: 30 threads, 6 hours with tmax = 448.
3- 300000 user bins: 30 threads, 11 hours with tmax = 576.
4- 350000 user bins: 30 threads, 20 hours with tmax = 640.

Increasing the number of user-bins will sure increase tmax, which will also increase the layout computation time (sketches are fast). I will run the main tests in couple of days and will send you all tested scenarios.

Ahmad

seqan / hibf

IBF and HIBF #102