BQ of generated reads do not match model specification

cntrier commented 5 years ago

Hi,

I've created a read model with the following script:

mitty create-read-model synth-illumina \ 100.pkl \ --read-length 100 \ --mean-template-length 250 \ --std-template-length 20 \ --bq0 30 \ --k 200 \ --sigma 5

And when I check the read model in Mitty with the following: mitty describe-read-model 100.pkl 100.png

It looks as expected: 100 But when I generate reads using the model with the following code: k=HG00632 i=100

mitty -v4 generate-reads GRCh38.p12.fa \ ./final_vcfs/${k}_all.vcf.gz \ ${k} all_mergedsorted.bed \ ${i}.pkl \ 40 \ 7 \ ${k}${i}reads-test.1.fq \ ${k}${i}-lq.txt \ --fastq2 ${k}_${i}reads-test.2.fq \ 2> vcf-${i}${k}.log

The generated reads have a flat BQ of 9 when I check them with FastQC:

And when I run the god-aligner to create a bam file, I can see in IGV that the reads are a mess. I've tried running different individuals, different read lengths but get the same pattern.

Have I misunderstood something with the read model generation?

Thank you very much for any help you can provide on the matter.

ghost commented 5 years ago

@cntrier thank you for this comprehensive report! Will make it easy to replicate and hopefully isolate and fix.

ghost commented 5 years ago

@cntrier if you use the parameters given in the examples in the Readme, can you replicate the results shown in the Readme? This information is useful to me as a means of understanding if the bug has always been there, or is a recently introduced bug. Thanks!

cntrier commented 5 years ago

Thank you for getting back to me.

I have now downloaded the demo data and ran the same parameters as in the README and encountered the same problem.

I filtered variants and generated a synthetic read model with the code from the README and I have the same output as the README with mitty describe-read-model function: When I generated reads using the synthetic read model with the parameters from the README, again the reads had a flat BQ of 9.

Here is the code:

FASTA=human_g1k_v37.fa.gz
SAMPLEVCF=1kg.20.22.vcf.gz
SAMPLENAME=HG00119
REGION_BED=region.bed
FILTVCF=HG00119-filt.vcf.gz
READMODEL=1kg-pcr-free.pkl
COVERAGE=30
READ_GEN_SEED=7
FASTQ_PREFIX=HG00119-reads
READ_CORRUPT_SEED=8

docker run --rm -v /Users/cassandra/Desktop/Ous_mitty:/VCF:rw ous_nbs/mitty /opt/conda/envs/mymitty/bin/mitty -v4 filter-variants \
  /VCF/${SAMPLEVCF} \
  ${SAMPLENAME} \
  /VCF/${REGION_BED} \
  -  \
  2> vcf-filter.log | bgzip -c > ${FILTVCF}

tabix -p vcf ${FILTVCF}

MODELNAME=synthetic-model.pkl

docker run --rm -v /Users/cassandra/Desktop/Ous_mitty:/VCF:rw ous_nbs/mitty /opt/conda/envs/mymitty/bin/mitty create-read-model synth-illumina \
  /VCF/${MODELNAME} \
  --read-length 121 \
  --mean-template-length 400 \
  --std-template-length 20 \
  --bq0 30 \
  --k 200 \
  --sigma 5 \
  --comment 'Model created for the demo'

docker run --rm -v /Users/cassandra/Desktop/Ous_mitty:/VCF:rw ous_nbs/mitty /opt/conda/envs/mymitty/bin/mitty describe-read-model /VCF/${MODELNAME} /VCF/${MODELNAME}.png

docker run --rm -v /Users/cassandra/Desktop/Ous_mitty:/VCF:rw ous_nbs/mitty /opt/conda/envs/mymitty/bin/mitty -v4 generate-reads \
  /VCF/${FASTA} \
  /VCF/${FILTVCF} \
  ${SAMPLENAME} \
  /VCF/${REGION_BED} \
  /VCF/${MODELNAME} \
  ${COVERAGE} \
  ${READ_GEN_SEED} \
  /VCF/${FASTQ_PREFIX}1.fq \
  /VCF/${FASTQ_PREFIX}-lq.txt \
  --fastq2 /VCF/${FASTQ_PREFIX}2.fq \
  --threads 2

Thank you for your time!

ghost commented 5 years ago

@cntrier thank you so much for this report - this is super helpful! I will fix this bug.

sbg / Mitty

BQ of generated reads do not match model specification #11