ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
37 stars 12 forks source link

FP SNPs in "golden" vcf? #73

Closed georgiiprovisor closed 1 month ago

georgiiprovisor commented 1 year ago

Describe the bug I generate reads with the following command:

"python /gpfs/home1/gozhegov/NEAT/gen_reads.py -r /gpfs/work3/0/qtholstg/hg38_res/Ref/Ref_simplified.fa -c 50 -R 147 -v {input.vcf} -tr  /gpfs/work3/0/qtholstg/hg38_res/intervals/Agilent_V6_hg38.uniq.bed -o {params.out_prefix} --pe 300 80 --bam --vcf"

In the golden VCF, I see a lot of SNPs that are not present in golden BAM (these regions are not covered at all) nor in the vcf that was used for producing reads.

Is it a bug or I'm using the tool wrong?

Screenshot from 2023-03-12 18-05-29

joshfactorial commented 1 year ago

By default, NEAT will still produce some SNPs as background mutations and errors outside of targeted regions, but those should also be reflected in the BAM. I'll have to dig in deeper to see what might be going on. Are you using the master branch or a tagged version of NEAT for this?

georgiiprovisor commented 1 year ago

I used the master branch.

There are too many SNPs outside golden bam, here is the statistic:

TotalGoldenVariants,1262106
FilteredGoldenVariants,0
MergedGoldenVariants,158
RedundantGoldenVariants,77
TotalWorkflowVariants,144001
FilteredWorkflowVariants,0
MergedWorkflowVariants,29
RedundantWorkflowVariants,0
PerfectMatches,143686
FN_variants,1118420
FP_variants,375
EquivalentVariants,1005

Seems like 10x more variants than are relly here.

joshfactorial commented 1 year ago

Okay, let me see if I can replicate.

georgiiprovisor commented 1 year ago

If it helps, I filtred "golden" vcf by capture kit. Leave only SNPs in the targeted area. And now it looks more realistic:

Total Golden Variants:   68287  [ 0 filtered, 9 merged, 2 redundant ]
Total Workflow Variants: 63435  [ 0 filtered, 10 merged, 0 redundant ]

Perfect Matches: 63342 (92.76%)
FN variants:     4945 (7.24%)
FP variants:     99

Number of equivalent variants denoted differently between the two vcfs: 520
joshfactorial commented 1 year ago

Okay. I'll see if there is a bug in the code causing this, or if it was a 'feature' as sometimes happens.