rrwick / Badread

a long read simulator that can imitate many types of read problems
GNU General Public License v3.0
171 stars 22 forks source link

--qualitity does not seem to take an absolute value #28

Closed hasindu2008 closed 9 months ago

hasindu2008 commented 9 months ago

First of all, I should thank you for writing a tool that was surprisingly easy to setup and with detailed documentation. I am facing a potential minor bug or I could be doing something silly.

Describe the bug I installed the latest badreads from GitHub. --qualitity 3000 option does not seem to work.

To reproduce badread simulate --reference /genome/nCoV-2019.reference.fasta --quantity 3000 --length 10800,13000 --error_model nanopore2020 --qscore_model nanopore2020 --identity 90,98,5 > reads.fastq

Error message No error, just only 3 reads are created instead of 3000.

Simulating: 3 reads  4,791 bp  100.0%

System info

Additional context When I use a coverage like 5X, it seem to work as intended.

rrwick commented 9 months ago

Hi Hasindu,

I apologise - the documentation isn't very clear about this. When giving an absolute quantity value (not using X), you are specifying the bases to be generated, not the read count. In your case, you got 3 reads because the third read brought the total over 3000 bp.

There's no direct way to specify read count, so you'll need to do the maths from your mean fragment length. Using your example: 10800 bp/read × 3000 reads = 32.4M. This should give you approximately 3000 reads. If you need exactly 3000 reads, you should round up a bit (e.g. 35M) and then truncate your output file.

However, I see that you're using a COVID-19 genome, which is very small (compared to the bacterial genomes I'm used to) and linear. This will mess with the maths, because Badread's length-distribution will be capped (you can't generate a 50 kbp read from a 30 kbp genome). This means many of your reads will span the full length of your reference sequence, and your mean read length will actually be less than the 10800 bp you asked for. So the above maths will probably give you more than the expected 3000 reads.

Hope that helps! Let me know if I can clarify anything.

Ryan

hasindu2008 commented 9 months ago

Thanks, ultimately I will be using the chr22 of the human genome. The COVID genome was just to play with. I will manually calculate the value for the quantity, which I assumed to be the read count, but now I know it is the total number of bases. Highly appreciate your fast response.

rrwick commented 9 months ago

No problem!

And since a human chromosome is pretty large, you may want to run multiple instances of Badread in parallel. Since Badread is pure Python, it's not very fast, but each read is independent so you can run many instances at once and then concatenate the results.

hasindu2008 commented 9 months ago

Actually, it wasn't too bad on a single thread - took 4-5 hours or so - much better than some other simulation tools that this Conda had to install half of the internet for over 2 days, and eventually gave some bizarre errors during runtime, making me just give up :D Badread is a pretty neatly written tool, I wish every Python bioinformatics tool was written like this to be easily set up!

rrwick commented 9 months ago

Glad it's working for you!