Closed smehringer closed 1 month ago
Documentation preview available at https://docs.seqan.de/preview/seqan/raptor/432
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.94%. Comparing base (
e73a40c
) to head (50252b1
). Report is 2 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@eseiler I'm currently running the script on the new refseq dataset.
Is there a reason why we write 10 Mio files with one read each instead of a single file containing all reads?
@eseiler I'm currently running the script on the new refseq dataset.
Is there a reason why we write 10 Mio files with one read each instead of a single file containing all reads?
Not really
If it's fine with you I would change this. I'm sure the IT is unhappy with the 10mio files I just created :D
If it's fine with you I would change this. I'm sure the IT is unhappy with the 10mio files I just created :D
Alright :)
How are the weights generated?
One major thing. Otherwise thanks for all the work. This is a big improvement. I did not check every single change as git did not show them nicely. It looks fine and I will check the read generation with the file sizes soon.
I wrote the major differences in the commit message. I think the only big difference is that I skip references/records that are shorter than the read length
@eseiler will you alter/add the hll vsersion? I can also do it if you want.
@eseiler will you alter/add the hll vsersion? I can also do it if you want.
I can do it.
As a followup we can do a "evenly" mode. Then we don't need separate generate_reads
and generate_reads_refseq
executables.
Generating reads for the simulated dataset is basically just using the same weight for each bin.
We could add the evenly mode in this PR, but because the scripts need to be adapted (and I have better versions for the script on another branch), switching over to a single executable should be a separate PR.
@smehringer I think I'm done
I think I never updated the changes I did to generate reads refseq.
The changes introduced here currently read in a file with weights and then generate a number of reads per bin s.t. large user bins have more reads than small ones.
Followup: