Quality scores not mimicking real data - Githubissues

ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.

Other

38 stars 12 forks source link

Quality scores not mimicking real data #65

Closed jlmarlo closed 3 months ago

jlmarlo commented 1 year ago

Hello! I've been using this program to simulate some whole genome sequencing data for horses to evaluate different variant calling and variant filtration methods. So far it's been really easy to use and understand. however I've been having issues with the Sequencing Error Model. Regardless of what options I use, what sample fastqs I provide or even if I use the default error models that come in the program files, I can't get any variation in the quality scores. The code that I'm using to generate the error models looks like this:

python genSeqErrorModel.py \
       -i /scratch.global/marlo072/simulateddata/M11445_R1_001.fastq.gz\
       -o /scratch.global/marlo072/simulateddata/M11445.SeqErrorModel\
       -i2 /scratch.global/marlo072/simulateddata/M11445_R2_001.fastq.gz \

Then I use the generated model with the following command:

python gen_reads.py \
       -r /scratch.global/marlo072/simulateddata/goldenPath.Ec_build-3.0_wMSY.fa \
       -R 130 \
       -o /scratch.global/marlo072/simulateddata/S005 \
       -c 10 \
       -e /scratch.global/marlo072/simulateddata/M11445.pickle.gz \
       -m /scratch.global/marlo072/simulateddata/M11445.MutModel.pickle.gz \
       --pe-model  /scratch.global/marlo072/simulateddata/M11445.fraglenModel.pickle.gz \
       --gc-model /scratch.global/marlo072/simulateddata/M11445GCModel.pickle.gz \
       --vcf \
       --bam

Another thing that may be worth noting is that the name of the sequencing error model appears different in the two commands because the output name for the error model, "M11445.SeqErrorModel," get's truncated to just "M11445" when genSeqErrorModel.py is run, so the corresponding output file is "M11445.pickle.gz". I've seen this happen anytime there's a period in the output name.

When running fastqc on the resulting fastqs I get this distribution which centers around the correct average from the example fastqc report, but does not follow any sort of pattern throughout read length.

Thank you for all of the great work you've done with this tool and I look forward to using it more in the future.

joshfactorial commented 1 year ago

Which version of the code are you using? The name thing is a consequence of pathlib, which treats periods delimiters between names and suffixes. I'll see if I can work around that.

joshfactorial commented 1 year ago

I did intentionally work on simplifying the quality score model a bit, but I may have overshot my target.

jlmarlo commented 1 year ago

I believe I'm using version 3.3. The documentation says it's version 3.2, but I cloned everything after the 3.3 update was published

joshfactorial commented 1 year ago

oh okay. Then yeah. let me try and confirm that I can reproduce the behavior you're seeing.

jlmarlo commented 1 year ago

Hello. I just wanted to check up on this issue and see if you were able to replicate the problem or if it is stemming from an error I am making myself.

joshfactorial commented 1 year ago

It most likely is the code, but we've been pushing hard on getting version 4.0 ready. I plan to address this in that release, which I'm hoping will be ready by the end of the year. when you run FastQC on your initial input fastqs, do they look more like you'd expect, with lower quality reads toward the ends?

jlmarlo commented 1 year ago

Yes they look like typical FastQC reports

joshfactorial commented 1 year ago

Okay, thank you. An earlier release may have better fastqs for the moment. I can make ticket to fix this for version 3.

joshfactorial commented 1 year ago

We investigated this and found that it was an introduced error on our part. For now, you can use version 2.0, which runs in Python 2. You should be able to install Python 2 using Conda still. I believe the only package required for normal use is numpy. We will be introducing a fix for this as soon as we can.

jlmarlo commented 1 year ago

Thank you for looking into this and giving an alternative for now. I look forward to the newest version of the tool and the new features you're working on. It's one of the most inclusive and easiest genome simulating tools I've worked with and I can't wait for the future of it!

joshfactorial commented 1 year ago

Apologies for the delays. I'm finally getting to this. Could you send me the quality score model that the code generated? Thanks!

jlmarlo commented 1 year ago

Here is the Error model which from my understanding also has the quality score model. A4416ErrorModel.pickle.gz

joshfactorial commented 1 year ago

I found a couple issues in our implementation. One of which is that it was averaging read1 and read2, which was no doubt flattening the profile a bit. I also needed to preserve some other data. I have a fix in mind that I'm currently implementing. I'm going to fix this in version 4.0 for sure, then retrofit that to version 3, if I have time.