ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

Number of variants generated from mutational model lower than expected #51

Closed aldosr closed 3 months ago

aldosr commented 2 years ago

Hi, I'm generating tumor reads using the same code described in #49 and bypassing the error using a simple try-except. Anyway, the number of mutations inserted (153) are considerably lower than the original input tumor dataset (~56.000), from which the mutational model is generated. Looking into the code, if I'm not wrong, the number of variants, generated with the poisson distribution, are lowered.

Do you think the number of mutations generated is correct or there is a bug in the mutational model / artificial reads generation? Thanks a lot.

joshfactorial commented 2 years ago

We had an issue with the poisson list previously, where it skewed the number of variants too low, but I pushed a fix a week or two ago. Which version of the code are you using?

joshfactorial commented 2 years ago

I did find an error in the way the poisson code worked. I'm working on a fix now.

joshfactorial commented 2 years ago

It turns out that the poisson list function was working as expected. What you're seeing in init_poisson is the calculation of the subset of mutations in the current window, not the calculation for the entire chromosome. When you ran gen_mut_model on your samples, it should have given an output such as:

p(snp)   = 0.3
p(indel) = 0.1
overall average mut rate: 0.00623
total variants processed: 23984

If you can find that, please post your numbers. It would be helpful to compare to the number of records in the input vcf, so please supply that. If you do not have the above output, try running:

pickle.load(gzip.open('models/errorModel.pickle.gz', 'r'))[4]

using your generated error model as input, and that should give the average mutation rate calculated by the function. Seeing the calculated rate and the # processed v # supplied may help us identify any errors in that script.

Finally, is the mutation count you gave based on the count in the golden vcf output? Or calculated by some other means?