Masterplan for dream-stellar

Things to check:

[ ] What does stellar do with long repeat regions (i.e. a whole 1Mb sequence only containing "A")
- Does it have more than one Swift Verification?
[ ] Have benchmarks and simulated datasets for (auto generated at build time and sha256sum verified, i.e. make datasets)
- [ ] 1. many large reference (e.g. 1MB) database sequences against a lot of small query sequences (short reads, e.g. 100B)
- Use-Case: Meta-Genomics (https://github.com/eaasna/test-DREAM-stellar/tree/main/metagenome)
- [ ] 2. one large reference (e.g. 1MB) database sequence against a lot of small query sequences (short reads, e.g. 100B)
- Use-Case: Regular Read-Mapping for example against a fly genome (https://github.com/eaasna/test-DREAM-stellar/tree/main/read-mapping)
- [ ] 3. one large reference (e.g. 1MB) database sequence against a single large query sequence (e.g. 1MB)
- Use-Case: Two different bacteria genomes and you want to map them against each other.
[ ] Have benchmarks on real datasets
- [ ] 1. ???
- [ ] 2. ???
- [ ] 3. ???

Things to do:

[x] 1.make stellar generally parallelize (https://github.com/marehr/dream_stellar/issues/4)
[ ] 2. split single large database sequences into multiple parts (https://github.com/eaasna/sliding-window/pull/27) and handle them in parallel (parallel over all the segments of one or more database sequences, in particular a single really large database which couldn't be parallelized otherwise)
[ ] 3. include https://github.com/eaasna/sliding-window/, https://github.com/marehr/shopping_cart_queue/ and use IBF to filter out which query sequences should be used to build a SWIFT-Filter for each database (reference) sequence
- [ ] many large reference (e.g. 1MB) database sequences against a lot of small query sequences (short reads, e.g. 100B)
- [ ] need to build the IBF in stellar (create a stellar-indexer; similar to raptor build we might want to have dream_stellar build)
- [ ] need to use IBF to search for potential eps-matches (this will be solved by extracting the functionality into a library in https://github.com/eaasna/sliding-window/pull/21)
- [ ] this will give for each query_id a set of database_id's
- [ ] we need ot convert that to a set of query_id's for each database_id (this is solved by the shopping cart queue)
[ ] 4. change the "taxonomy" what a database sequence is (it can be a database segment, i.e. database ID + start position + end position) for example |AAAA|AAAA|A has 3 segments of same database sequence (slightly overlapping)
- (one large reference (e.g. 1MB) database sequence against a lot of small query sequences (short reads, e.g. 100B))
- Note: This is a combination of the 2. and 3. Problem

After that Long Term:

[ ] split query sequence into segments (slightly overlapping) and ask the IBF which segments would produce a eps-match; and only build the SWIFT-Filter for all the intresting query segments.
- (one large reference (e.g. 1MB) database sequence against a single large query sequence (e.g. 1MB))

Mind Example: AAAAAAAAAAAAAAAA|AAAAA]AAAAAAAA|AAAA]AAA

DatabaseSegment1 -> Database1
DatabaseSegment2 -> Database1 (IBF could just say this one)
DatabaseSegment3 -> Database1

marehr / dream_stellar

Masterplan for dream-stellar #24