ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

Unable to simulate HOM (1/1) variants? Reads/VCF only show HET (0/1) #55

Closed blajoie closed 2 years ago

blajoie commented 2 years ago

Version NEAT-genReads V3.2

Hi - Just recently starting using this tool, but noticed that the reads / vcf / bam only support HET calls even when using a input VCF file containing many genotypes. Ploidy is set to 2, so I assume two haplotypes are being constructed from which to sample reads from, but still we are only seeing HET support in the reads. Any ideas?

Steps to reproduce below.

truth_vcf

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NEAT
phix 363     .       T       C       222     .       .       GT:PL   0/1:255,0,255
phix 466     .       T       A       222     .       .       GT:PL   1/1:255,0,255
phix 469     .       T       C       222     .       .       GT:PL   1/1:255,0,255
phix 513     .       G       A       222     .       .       GT:PL   1/1:255,0,255
phix 528     .       G       T       222     .       .       GT:PL   0/1:255,0,255

Running NEAT (phix as an example)

$ python ~/git/NEAT/gen_reads.py -r genome.fa -R 150 -o neat-sim -c 35.0 -E 0.0 -M 0.0 --pe 350 70 -d --bam --vcf -p 2 --force-coverage -v truth.vcf
Using default sequencing error model.
Warning: Quality scores no longer exactly representative of error probability. Error model scaled by 0.000 to match desired rate...
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=350 std=70
found index genome.fa.fai
--------------------------------
reading input VCF...

found 5 valid variants in input vcf.
 * 0 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading elemphix_padded...
0.002 (sec)
found 5 valid variants for elemphix_padded in input VCF...
--------------------------------
sampling reads...
[PROCESSING WINDOW: ((0, 5566), [0]), next: (5216, 5566), isLastTime: True]
Read sampling completed in 0 (sec)
Writing output VCF...
NEAT finished the simulation in 0.8293910026550293

Followed by naive pileup + bcftools call to check GT status

$ samtools view -h neat-sim_golden.bam | bcftools mpileup --threads 1 --no-BAQ -Q 0 --ff UNMAP,SECONDARY,QCFAIL -Ov -f genome.fa -a 'AD,ADF,ADR,DP,SP,INFO/AD,INFO/ADF,INFO/ADR,FORMAT/SP' - | bcftools call --threads 1 -m -A | gre
p -v 0/0
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.11+htslib-1.11
##bcftoolsCommand=mpileup --threads 1 --no-BAQ -Q 0 --ff UNMAP,SECONDARY,QCFAIL -Ov -f genome.fa -a AD,ADF,ADR,DP,SP,INFO/AD,INFO/ADF,INFO/ADR,FORMAT/SP -
##reference=file://genome.fa
##contig=<ID=phix,length=5566>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand (high-quality bases)">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand (high-quality bases)">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths (high-quality bases)">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand (high-quality bases)">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand (high-quality bases)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.11+htslib-1.11
##bcftools_callCommand=call --threads 1 -m -A; Date=Tue May 24 08:53:40 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NEAT
phix 363     .       T       C       185     .       DP=28;ADF=6,10;ADR=8,4;AD=14,14;VDB=0.201504;SGB=-0.686358;RPB=0.800964;MQB=1;MQSB=1;BQB=0.869704;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=6,8,10,4;MQ=60     GT:PL:DP:SP:ADF:ADR:AD  0/1:218,0,255:28:6:6,10:8,4:14,14
phix 466     .       T       A       200     .       DP=32;ADF=9,5;ADR=10,8;AD=19,13;VDB=0.244124;SGB=-0.683931;RPB=0.981766;MQB=1;MQSB=1;BQB=0.549964;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=9,10,5,8;MQ=60     GT:PL:DP:SP:ADF:ADR:AD  0/1:233,0,255:32:1:9,5:10,8:19,13
phix 469     .       T       C       181     .       DP=33;ADF=9,5;ADR=10,9;AD=19,14;VDB=0.383333;SGB=-0.686358;RPB=0.869806;MQB=1;MQSB=1;BQB=0.260949;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=9,10,5,9;MQ=60     GT:PL:DP:SP:ADF:ADR:AD  0/1:214,0,255:33:1:9,5:10,9:19,14
phix 513     .       G       A       222     .       DP=39;ADF=12,7;ADR=9,11;AD=21,18;VDB=0.866788;SGB=-0.691153;RPB=0.930232;MQB=1;MQSB=1;BQB=0.334958;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=12,9,7,11;MQ=60   GT:PL:DP:SP:ADF:ADR:AD  0/1:255,0,255:39:5:12,7:9,11:21,18
phix 528     .       G       T       222     .       DP=41;ADF=13,9;ADR=8,11;AD=21,20;VDB=0.931105;SGB=-0.692067;RPB=0.966558;MQB=1;MQSB=1;BQB=0.983471;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=13,8,9,11;MQ=60   GT:PL:DP:SP:ADF:ADR:AD  0/1:255,0,255:41:5:13,9:8,11:21,20

As you can see, all 5 come through as 0/1 HET. Is it possible to simulate reads across either GT as defined in the VCF? Any thoughts what could be going wrong here?

Cheers

blajoie commented 2 years ago

I just tried the v3.2 tagged release (instead of master) and it indeed can produce HOM https://github.com/ncsa/NEAT/releases/tag/3.2

python NEAT-3.2/gen_reads.py -r genome.fa -R 150 -o neat-sim -c 35.0 -E 0.0 -M 0.0 --pe 350 70 -d --bam --vcf -p 2 --force-coverage -v truth.vcf

mpileup + call variants

$ samtools view -h neat-sim_golden.bam | bcftools mpileup --threads 1 --no-BAQ -Q 0 --ff UNMAP,SECONDARY,QCFAIL -Ov -f genome.fa -a 'AD,ADF,ADR,DP,SP,INFO/AD,INFO/ADF,INFO/ADR,FORMAT/SP' - | bcftools call --threads 1 -m -A | grep -v 0/0

phix 363     .       T       C       222     .       DP=38;ADF=11,12;ADR=9,6;AD=20,18;VDB=0.712919;SGB=-0.691153;RPB=0.789777;MQB=1;MQSB=1;BQB=0.757384;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=11,9,12,6;MQ=60   GT:PL:DP:SP:ADF:ADR:AD  0/1:255,0,255:38:3:11,12:9,6:20,18
phix 466     .       T       A       225     .       DP=38;ADF=0,17;ADR=0,21;AD=0,38;VDB=0.184623;SGB=-0.693143;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,17,21;MQ=60  GT:PL:DP:SP:ADF:ADR:AD  1/1:255,114,0:38:0:0,17:0,21:0,38
phix 469     .       T       C       225     .       DP=38;ADF=0,17;ADR=0,21;AD=0,38;VDB=0.150404;SGB=-0.693143;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,17,21;MQ=60  GT:PL:DP:SP:ADF:ADR:AD  1/1:255,114,0:38:0:0,17:0,21:0,38
phix 513     .       G       A       225     .       DP=40;ADF=0,17;ADR=0,23;AD=0,40;VDB=0.246379;SGB=-0.693145;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,17,23;MQ=60  GT:PL:DP:SP:ADF:ADR:AD  1/1:255,120,0:40:0:0,17:0,23:0,40
phix 528     .       G       T       222     .       DP=41;ADF=9,10;ADR=7,15;AD=16,25;VDB=0.662778;SGB=-0.692914;RPB=0.957706;MQB=1;MQSB=1;BQB=0.394977;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=9,7,10,15;MQ=60   GT:PL:DP:SP:ADF:ADR:AD  0/1:255,0,250:41:5:9,10:7,15:16,25
joshfactorial commented 2 years ago

Okay, so it appears I need to merge the tagged version into master, in that case.

joshfactorial commented 2 years ago

This is one area we are working on improving in version 4.0 as well. Let me know if version 3.2 is working as expected.

joshfactorial commented 2 years ago

I found a small bug in master. I fixed that and now it outputs HOM variants when HOM variants were in the input vcf.

blajoie commented 2 years ago

Great - thanks for that @joshfactorial !