seqan / raptor

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.
https://docs.seqan.de/raptor
Other
52 stars 18 forks source link

[UPDATE] Missing update on generate reads refseq #432

Closed smehringer closed 1 month ago

smehringer commented 1 month ago

I think I never updated the changes I did to generate reads refseq.

The changes introduced here currently read in a file with weights and then generate a number of reads per bin s.t. large user bins have more reads than small ones.

Followup:

seqan-actions commented 1 month ago

Documentation preview available at https://docs.seqan.de/preview/seqan/raptor/432

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.94%. Comparing base (e73a40c) to head (50252b1). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #432 +/- ## ======================================= Coverage 99.94% 99.94% ======================================= Files 51 51 Lines 1676 1676 Branches 1 1 ======================================= Hits 1675 1675 Misses 1 1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

smehringer commented 1 month ago

@eseiler I'm currently running the script on the new refseq dataset.

Is there a reason why we write 10 Mio files with one read each instead of a single file containing all reads?

eseiler commented 1 month ago

@eseiler I'm currently running the script on the new refseq dataset.

Is there a reason why we write 10 Mio files with one read each instead of a single file containing all reads?

Not really

smehringer commented 1 month ago

If it's fine with you I would change this. I'm sure the IT is unhappy with the 10mio files I just created :D

eseiler commented 1 month ago

If it's fine with you I would change this. I'm sure the IT is unhappy with the 10mio files I just created :D

Alright :)

eseiler commented 1 month ago

How are the weights generated?

eseiler commented 1 month ago

One major thing. Otherwise thanks for all the work. This is a big improvement. I did not check every single change as git did not show them nicely. It looks fine and I will check the read generation with the file sizes soon.

I wrote the major differences in the commit message. I think the only big difference is that I skip references/records that are shorter than the read length

smehringer commented 1 month ago

@eseiler will you alter/add the hll vsersion? I can also do it if you want.

eseiler commented 1 month ago

@eseiler will you alter/add the hll vsersion? I can also do it if you want.

I can do it.

As a followup we can do a "evenly" mode. Then we don't need separate generate_reads and generate_reads_refseq executables. Generating reads for the simulated dataset is basically just using the same weight for each bin.

We could add the evenly mode in this PR, but because the scripts need to be adapted (and I have better versions for the script on another branch), switching over to a single executable should be a separate PR.

eseiler commented 1 month ago

@smehringer I think I'm done