nh13 / DWGSIM

Whole Genome Simulator for Next-Generation Sequencing
GNU General Public License v2.0
92 stars 36 forks source link

How is read depth distributed across the genome? #82

Closed samsonweiner closed 1 year ago

samsonweiner commented 1 year ago

Hello, I'm hoping you can help me understand how coverage is distribution across the genome. Some sequencing technologies have non-uniform coverage, and I'm investigating how well this can be simulated.

If I call dwgsim with 1,000,000 reads over the standard human reference genome, will these reads be distributed relatively evenly? I'm aware that dwgsim can also be called with the -C parameter. Let's say I set -C to 10, that indicates a mean of 10 reads at each base, but what is the deviation? That is, will some regions have significantly more or less than a coverage of 10? Is there a way to control it?

Thank you in advance for your help.

nh13 commented 1 year ago

The -C option is used as to compute the # of read pairs to simulate. If you're genome (or region) is 1000 bases, you want 10x coverage, and you're reads are 2x150b, then you will get 1000 * 10 / (150 + 150) read pairs output (not including random reads). From there, read pairs are simulated by randomly drawing a position in the genome (uniform distribution).

There is no attempt to do anything more complicated modeling the coverage of existing sequencing technology. Perhaps you could simulate as much coverage as you need, then post-process (filter/remove reads) in regions that have reduced coverage?

samsonweiner commented 1 year ago

Thank you for the explanation! This is exactly what I needed to know.

I'll go ahead and close the issue.