ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

Error simulating reads using input VCF for human Y chromosome #28

Closed angelat1020 closed 2 years ago

angelat1020 commented 2 years ago

Hello, I am using NEAT to simulate paired-end reads using the human reference genome GRCh38 and variants from a VCF from a 1000 Genomes individual for chromosomes Y. When running NEAT, I get the following error:

# Chromosome Y (ploidy set to 1)

~/miniconda3/envs/NEAT_env/bin/python3.8 ~/packages/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf -M 0 -p 1

Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...

found 1273 valid variants in input vcf.
 * 167631 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading chrY...

1379.071 (sec)
found 1273 valid variants for chrY in input VCF...
--------------------------------
sampling reads...
[Traceback (most recent call last):
  File "/home/amtarave/packages/NEAT/gen_reads.py", line 902, in <module>
    main()
  File "/home/amtarave/packages/NEAT/gen_reads.py", line 549, in main
    print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, '
UnboundLocalError: local variable 'buffer_added' referenced before assignment

I am not sure what is causing this UnboundLocalError: local variable 'buffer_added' referenced before assignment error.

Would you be able to assist me with troubleshooting this?

I am attaching here the reference genome (chromosome Y only) and the VCF I used in the above command.

20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz

GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz

Thank you.

Best, Angela

angelat1020 commented 2 years ago

Hi, I would just like to follow up on this.

I cloned the most resent version of this repository this morning and re-ran my command, however I am now getting a different error. Also, in both of these scenarios, there are output files but they are empty.

~/miniconda3/envs/NEAT_env/bin/python3.8 /home/amtarave/packages/NEAT_2021-09-14/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid_NEAT_2021-09-14 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf -M 0 -p 1
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...

found 1273 valid variants in input vcf.
 * 167631 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading chrY... 
1689.526 (sec)
found 1273 valid variants for chrY in input VCF...
--------------------------------
sampling reads...
[Traceback (most recent call last):
  File "/home/amtarave/packages/NEAT_2021-09-14/NEAT/gen_reads.py", line 902, in <module>
    main()
  File "/home/amtarave/packages/NEAT_2021-09-14/NEAT/gen_reads.py", line 610, in main
    [cig for cig in sequences.all_cigar[1] if len(cig) != 100]:
IndexError: list index out of range

I also tried NEAT on chromosome 21 and it ran successfully, so I am not sure if maybe its a ploidy issue.

~/miniconda3/envs/NEAT_env/bin/python3.8 ~/packages/NEAT_2021-09-14/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chr21.fa -R 101 -o NA07048_NEAT_simulated_chr21_10x_diploid_NEAT_2021-09-14 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr21.recalibrated_variants_NA07048.vcf -M 0
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chr21.fa.fai
--------------------------------
reading input VCF...

found 51171 valid variants in input vcf.
 * 1502635 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading chr21...
281.601 (sec)
found 51171 valid variants for chr21 in input VCF...
--------------------------------
sampling reads...
[----------]
Read sampling completed in 3825 (sec)
Writing output VCF...

Would you be able to help get this to run successfully and output simulated FASTQ files for the Y chromosome?

Thank you.

joshfactorial commented 2 years ago

First thing I would check is if the naming convention is the same between your reference and the VCF. Beyond that it could be a problem with ploidy, as we haven't tested that extensively yet. If that doesn't solve the problem, I will need your input files (assuming they are sharable). If we can't share them, I'll experiment with a similar setup.

angelat1020 commented 2 years ago

The naming in the reference and VCF is the same (chrY).

head GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa
>chrY
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

grep -v "#" 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf | head
chrY    2781554 .       G       A       70.37   VQSRTrancheSNP99.80to100.00     AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00446429;AF_AMR_unrel=0.00588235;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=51.43;MQ0=0;NEGATIVE_TRAIN_SITE;QD=17.59;SOR=3.258;VQSLOD=-19.63;culprit=SOR       GT:AD:DP:GQ:PL  0:4,0:4:99:0,104
chrY    2781622 .       C       T       261.37  VQSRTrancheSNP99.80to100.00     AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=1;AC_SAS_unrel=1;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0.0031348;AF_SAS_unrel=0.00384615;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=60;MQ0=0;QD=34.04;SOR=4.174;VQSLOD=-32.73;culprit=SOR       GT:AD:DP:GQ:PL  0:4,0:4:99:0,104
chrY    2781625 .       A       G       261.37  VQSRTrancheSNP99.80to100.00     AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=1;AC_SAS_unrel=1;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0.0031348;AF_SAS_unrel=0.00384615;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=60;MQ0=0;QD=29.82;SOR=4.174;VQSLOD=-31.64;culprit=SOR       GT:AD:DP:GQ:PL  0:4,0:4:99:0,104
chrY    2781639 .       A       G       508.37  VQSRTrancheSNP99.80to100.00     AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00446429;AF_AMR_unrel=0.00588235;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=58.11;MQ0=0;QD=29.9;SOR=3.383;VQSLOD=-21.19;culprit=SOR    GT:AD:DP:GQ:PL  0:4,0:4:99:0,104
chrY    2781749 .       G       A       441.46  VQSRTrancheSNP99.80to100.00     AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00452489;AF_AMR_unrel=0.00595238;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=446;AN_AFR_unrel=310;AN_AMR=221;AN_AMR_unrel=168;AN_EAS=289;AN_EAS_unrel=241;AN_EUR=298;AN_EUR_unrel=235;AN_SAS=312;AN_SAS_unrel=255;DP=10;FS=0;MQ=16.14;MQ0=0;QD=33.49;SOR=0.859;VQSLOD=-21.68;culprit=MQ   GT:AD:DP:GQ:PL  0:10,0:10:90:0,90
chrY    2781758 .       C       T       865.09  VQSRTrancheSNP99.80to100.00     AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=1;AC_EAS_unrel=1;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00502513;AF_AMR_unrel=0.00657895;AF_EAS=0.003663;AF_EAS_unrel=0.00440529;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=421;AN_AFR_unrel=292;AN_AMR=199;AN_AMR_unrel=152;AN_EAS=273;AN_EAS_unrel=227;AN_EUR=273;AN_EUR_unrel=212;AN_SAS=301;AN_SAS_unrel=250;DP=10;FS=0;MQ=9.36;MQ0=0;QD=26.86;SOR=1.051;VQSLOD=-25.28;culprit=MQ    GT:AD:DP:GQ:PL  0:10,0:10:90:0,90
chrY    2781815 .       G       A       260.37  PASS    AC=0;AC_AFR=1;AC_AFR_unrel=1;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0.00218818;AF_AFR_unrel=0.0031348;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=60;MQ0=0;NEGATIVE_TRAIN_SITE;QD=28.93;SOR=1.402;VQSLOD=-6.317;culprit=DP    GT:AD:DP:GQ:PL  0:5,0:5:99:0,174
chrY    2781824 .       C       T       1015.35 PASS    AC=0;AC_AFR=2;AC_AFR_unrel=1;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0.00437637;AF_AFR_unrel=0.0031348;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=59.51;MQ0=0;NEGATIVE_TRAIN_SITE;QD=30.77;SOR=1.542;VQSLOD=-7.015;culprit=DP GT:AD:DP:GQ:PL  0:5,0:5:99:0,174
chrY    2782061 .       C       A       940.35  PASS    AC=0;AC_AFR=2;AC_AFR_unrel=2;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0.00437637;AF_AFR_unrel=0.00626959;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=60;MQ0=0;NEGATIVE_TRAIN_SITE;QD=31.35;SOR=1.536;VQSLOD=-6.307;culprit=DP   GT:AD:DP:GQ:PL  0:5,0:5:99:0,174
chrY    2782660 .       G       A       1783.10 PASS    AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=3;AC_EUR_unrel=3;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0.00983607;AF_EUR_unrel=0.0125;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=60;MQ0=0;NEGATIVE_TRAIN_SITE;QD=32.42;SOR=1.371;VQSLOD=-6.652;culprit=DP       GT:AD:DP:GQ:PL  0:5,0:5:99:0,174

These are publicly available files, so I can share the input files here. I have zipped both the reference and VCF and attached them here. Please let me know if you cannot download them.

Reference fasta: GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz

Chr Y VCF: 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz

joshfactorial commented 2 years ago

Okay, I think I figured it out. The cigar string is due to the fact that the code assumed a ploidy = 2. I commented out that code. I think that code was for debugging in the first place, so I removed the offending lines. I also fixed the vcf function so that it can read in gzipped files. It ran to completion without that. I pushed those changes to the master branch.

joshfactorial commented 2 years ago

Let me know if that doesn't work.

angelat1020 commented 2 years ago

Great! I re-ran this and it worked with the unzipped VCF (but not with the gzipped VCF).

#-----------------#
# With zipped VCF
#-----------------#
~/miniconda3/envs/NEAT_env/bin/python3.8 /home/amtarave/packages/NEAT_2021-09-15/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid_NEAT_2021-09-15 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz -M 0 -p 1
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...

ERROR: VCF has no header?
20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz

#--------------------#
# With unzipped VCF #
#--------------------#
~/miniconda3/envs/NEAT_env/bin/python3.8 /home/amtarave/packages/NEAT_2021-09-15/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid_NEAT_2021-09-15 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf -M 0 -p 1
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...

found 1273 valid variants in input vcf.
 * 167631 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 0 variants skipped due to multiple variants found per position
--------------------------------
reading chrY...
988.526 (sec)
found 1273 valid variants for chrY in input VCF...
--------------------------------
sampling reads...
[----------]
Read sampling completed in 1284 (sec)
Writing output VCF...

Thank you for your help with this!