seqan / product_backlog

This repository is used as product backlog for all SeqAn relevant backlog items. This is intended to organise the work for the team.
2 stars 1 forks source link

[IBF] seqan3::binning_directory #36

Open rrahn opened 4 years ago

rrahn commented 4 years ago

Description

Build an IBF over the k-mers of the technical bins. Given this IBF allow to search for a query and report all bins that cover at least one k-mer of the query.

Additional resources: https://trello.com/c/Q8HFn4RO/64-protocol-of-the-meeting-09032020-missing https://docs.google.com/document/d/1LxDRXm4kLMuFYyRBnml4uZttOUIzu3eCsj78jfODz2g

Acceptance Criteria

Tasks

- [ ] task 1 - [ ] task 2 - [ ] task 3 ### Definition of Done - [ ] Implementation and design approved - [ ] Unit tests pass - [ ] Test coverage = 100% - [ ] Microbenchmarks added and/or affected microbenchmarks < 5% performance drop - [ ] API documentation added - [ ] Tutorial/teaching material added - [ ] Test suite compiles in less than 30 seconds (on travis) - [ ] Changelog entry added
eseiler commented 4 years ago

https://github.com/eseiler/minimizer_ibf/tree/seqan3

marehr commented 4 years ago

Some questions in the core meeting:

eseiler commented 4 years ago

Drawing.pdf

Usage Examples ### Construction ```cpp // (1) std::vector bin_files{...}; seqan3::binning_directory bd{bin_files, seqan3::bin_count{4u}, seqan3::bin_size{128u}, seqan3::hash_function_count{3}, seqan3::views::kmer_hash(seqan3::shape{seqan3::ungapped{3}})}; // (2) std::vector bin_files{...}; seqan3::interleaved_bloom_filter ibf{seqan3::bin_count{4u}, seqan3::bin_size{128u}, seqan3::hash_function_count{3}}; seqan3::binning_directory bd{bin_files, ibf, seqan3::views::kmer_hash(seqan3::shape{seqan3::ungapped{3}})}; // -> What about compression? // Passing parameters (1): Handle compression internally - construct uncompressed, then compress. // Passing IBF (2): Extract parameters, then same as above. // Possible template parameters: // sequence_file_input type // compression - can be deduced from ctor for (2) // view type - can be deduced from ctor for (1) and (2) ``` ### Query ```cpp std::vector counts_per_bin = bd.count(query); // Can we use search_result here? // size_t should be a template parameter? (thinking of SIMD stuff) ``` We actually write "report all bins that contains at least one kmer". Should this be part of the binning directory? ### Other * It should offer the same(?) interface as the IBF(`emplace`, `increase_bin_number_to`, ...) since the binning_directory. * After `increase_bin_number_to`, one might want to call `emplace` with a filepath to add all kmers to a bin. * We also may need a `clear` for a specific bin (missing in IBF). * Maybe also store the `file->bin` mapping.
Q/A * **Is this a data structure?** Yes, it encapsulates the IBF and handles the sequence I/O and kmer transformation. * **What should this do?** It takes the technical bins, e.g. 1 file for each bin, builds the IBF over these bins and answers bin membership for a query (count of found kmers of query per bin). * **How does the interface looks like?** See usage examples. * **What is the use case?** It is used to build the IBF over technical bins and return count vectors for a query. E.g., for ganon, this data structure already offers everything it needs. * **Is this private / public?** Public! * **We had some pictures, can we add them here?** Currently in the trello together with all the other meeting stuff. * **How does this play with "logical" bins?** We have technical and user bins. Technical bins is handled here, user bins is partitioning and not part of this.
Some rough code ```cpp // ----------------------------------------------------------------------------------------------------- // Copyright (c) 2006-2020, Knut Reinert & Freie Universität Berlin // Copyright (c) 2016-2020, Knut Reinert & MPI für molekulare Genetik // This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License // shipped with this file and also available at: https://github.com/seqan/seqan3/blob/master/LICENSE.md // ----------------------------------------------------------------------------------------------------- /*!\file * \author Enrico Seiler * \brief Provides seqan3::binning_directory. */ #pragma once #include namespace seqan3 { struct binning_directory_traits { using sequence_trait = sequence_file_input_default_traits_dna, using compression_trait = data_layout::uncompressed; using view_trait = ?; } template class binning_directory { private: static constexpr data_layout data_layout_mode = traits::compression_trait; interleaved_bloom_filter ibf; using sequence_file_input_t = sequence_file_input>; public: binning_directory() = default; binning_directory(binning_directory const &) = default; binning_directory & operator=(binning_directory const &) = default; binning_directory(binning_directory &&) = default; binning_directory & operator=(binning_directory &&) = default; ~binning_directory() = default; // Either we construct the IBF outside and pass it here, or pass all the parameters for the IBF... binning_directory(std::vector const & technical_bins interleaved_bloom_filter const & ibf_, kmer_transform && kmer_transform_view) : ibf(ibf_) { assert(ibf.bin_count <= technical_bins.size()); for (size_t i = 0; i < technical_bins.size(); ++i) { sequence_file_input_t fin{technical_bins[i]}; for (auto & [sequence] : fin) { for (auto && hash : sequence | traits::view_trait()) // How to construct view? ibf.emplace(hash, seqan3::bin_index{i}); } } } // search result? template alphabet convertible to sequence_file_input_t::alphabet std::vector search(t const & sequence) { std::vector result(ibf.bin_count(), 0); // Should offer same interface as IBF // I.e. either it returns reference to internal, mutable, result buffer or accepts input of a vector to write to // (for parallel stuff) for (auto && hash : sequence | traits::view_trait()) { auto & res ibf.bulk_contains(hash); size_t bin{0}; for (size_t batch = 0; batch < ((ibf.bin_count() + 63) >> 6); ++batch) { size_t tmp = res.get_int(batch * 64); if (tmp ^ (1ULL<<63)) { while (tmp > 0) { uint8_t step = sdsl::bits::lo(tmp); bin += step++; tmp >>= step; ++result[bin++]; } } else { ++result[bin + 63]; } } } return result; } }; } ```
eseiler commented 4 years ago

See https://docs.google.com/document/d/1LxDRXm4kLMuFYyRBnml4uZttOUIzu3eCsj78jfODz2g

marehr commented 4 years ago

2020-06-18:

We discussed whether to make the template parameter for (un-) compressed easy to set.

#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/range/views/kmer_hash.hpp>
#include <seqan3/search/dream_index/technical_binning_directory.hpp>

using seqan3::operator""_dna4;

int main()
{
    seqan3::ibf_config cfg{seqan3::bin_count{8u},
                           seqan3::bin_size{1ULL<<16},
                           seqan3::hash_function_count{2u}};

    std::vector<std::vector<seqan3::dna4>> technical_bins{"ACTGACTGACTGATC"_dna4,
                                                          "GTGACTGACTGACTCG"_dna4,
                                                          "AAAAAAACGATCGACA"_dna4};

    auto v = seqan3::views::kmer_hash(seqan3::ungapped{5u});
    seqan3::technical_binning_directory tmp{technical_bins, std::move(v), cfg};

                                        // :(
    seqan3::technical_binning_directory<decltype(v), seqan3::data_layout::compressed> tbd{tmp};

    auto & result = tbd.bulk_contains(0);
    seqan3::debug_stream << result << '\n'; // [0,0,1,0,0,0,0,0]
}

Possible Solutions:

Resolution:

smehringer commented 1 year ago

Thought dump:

Maybe a seqan3::binning_directory is the wrong solution. Want we want is a convenient way to build the seqan3::interleaved_bloom_filter from sequences given a configuration.

Maybe a seqan3::ibf_factory is a better way of providing this. It will get a configuration and construct an ibf from it.