ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
47 stars 14 forks source link

Coverage not working as intended #54

Closed PratyushTandale closed 4 months ago

PratyushTandale commented 2 years ago

Hello dev, I have been using your tool to work on some indel simulation and i see that the -c parameter is a hit or miss when it comes to simulation.

Commands that i tried

python ./NEAT/gen_reads.py -r hs37d5.fa -tr target.bed -R 151 -c 25 -E 0.0027 -M 0 -v dup_50_100_NGS120.vcf --pe 255 84 -p 2 --bam --vcf -o neat-50-100-dup-NGS120

so in this case i get coverage around 25 but it drops lower than 25 coverage and doesnt generate the variants that i have provided in the vcf file.

In order to force the coverage i used the --force-coverage parameter shown in the help document --force-coverage [debug] ignore fancy models, force coverage to be constant (default: False)

python ./NEAT/gen_reads.py -r hs37d5.fa -tr target.bed -R 151 -c 25 --force-coverage -E 0.0027 -M 0 -v dup_50_100_NGS120.vcf --pe 255 84 -p 2 --bam --vcf -o neat-50-100-dup-NGS120

after using this command i get the variants i want generated nicely but the coverage in some regions just jump 20-50times more which is also not helpful as the coverage is going way above the required value i want to test at.

Is there a way to get the coverage at a certain level containing the variant with good reads.

joshfactorial commented 2 years ago

Okay, I've been doing some investigation into this. A couple of things I'm noticing:

I did not see coverage jumping to 20-50x my input coverage. With a -c 25, I saw a max of 30. To my way of thinking, we would want the following behavior:

  1. insert all valid input variants (i.e., we would still skip variants that had an incorrect REF or whose position was out of range), regardless of targeted bed
  2. force-coverage produces a constant coverage set of reads
  3. no reads outside the targeted coverage bed, if --discard-offtarget set

Items 2 and 3 will probably have to wait for version 4.0, because I'm not sure the code will allow for it as written. Item 1 I can work on now. Would you agree with those points?

joshfactorial commented 2 years ago

I see the problem with force coverage: it was never fully implemented.

joshfactorial commented 4 months ago

I have improved the coverage calculations and procedure, in version 4.2. For now, we have eliminated GC bias and simplified the targeted regions bed process. Check it out and post any further ideas/comments/solutions you have!