Closed angelat1020 closed 2 years ago
Hi, I would just like to follow up on this.
I cloned the most resent version of this repository this morning and re-ran my command, however I am now getting a different error. Also, in both of these scenarios, there are output files but they are empty.
~/miniconda3/envs/NEAT_env/bin/python3.8 /home/amtarave/packages/NEAT_2021-09-14/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid_NEAT_2021-09-14 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf -M 0 -p 1
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...
found 1273 valid variants in input vcf.
* 167631 variants skipped: (qual filtered / ref genotypes / invalid syntax)
* 0 variants skipped due to multiple variants found per position
--------------------------------
reading chrY...
1689.526 (sec)
found 1273 valid variants for chrY in input VCF...
--------------------------------
sampling reads...
[Traceback (most recent call last):
File "/home/amtarave/packages/NEAT_2021-09-14/NEAT/gen_reads.py", line 902, in <module>
main()
File "/home/amtarave/packages/NEAT_2021-09-14/NEAT/gen_reads.py", line 610, in main
[cig for cig in sequences.all_cigar[1] if len(cig) != 100]:
IndexError: list index out of range
I also tried NEAT on chromosome 21 and it ran successfully, so I am not sure if maybe its a ploidy issue.
~/miniconda3/envs/NEAT_env/bin/python3.8 ~/packages/NEAT_2021-09-14/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chr21.fa -R 101 -o NA07048_NEAT_simulated_chr21_10x_diploid_NEAT_2021-09-14 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr21.recalibrated_variants_NA07048.vcf -M 0
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chr21.fa.fai
--------------------------------
reading input VCF...
found 51171 valid variants in input vcf.
* 1502635 variants skipped: (qual filtered / ref genotypes / invalid syntax)
* 0 variants skipped due to multiple variants found per position
--------------------------------
reading chr21...
281.601 (sec)
found 51171 valid variants for chr21 in input VCF...
--------------------------------
sampling reads...
[----------]
Read sampling completed in 3825 (sec)
Writing output VCF...
Would you be able to help get this to run successfully and output simulated FASTQ files for the Y chromosome?
Thank you.
First thing I would check is if the naming convention is the same between your reference and the VCF. Beyond that it could be a problem with ploidy, as we haven't tested that extensively yet. If that doesn't solve the problem, I will need your input files (assuming they are sharable). If we can't share them, I'll experiment with a similar setup.
The naming in the reference and VCF is the same (chrY).
head GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa
>chrY
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
grep -v "#" 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf | head
chrY 2781554 . G A 70.37 VQSRTrancheSNP99.80to100.00 AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00446429;AF_AMR_unrel=0.00588235;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=51.43;MQ0=0;NEGATIVE_TRAIN_SITE;QD=17.59;SOR=3.258;VQSLOD=-19.63;culprit=SOR GT:AD:DP:GQ:PL 0:4,0:4:99:0,104
chrY 2781622 . C T 261.37 VQSRTrancheSNP99.80to100.00 AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=1;AC_SAS_unrel=1;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0.0031348;AF_SAS_unrel=0.00384615;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=60;MQ0=0;QD=34.04;SOR=4.174;VQSLOD=-32.73;culprit=SOR GT:AD:DP:GQ:PL 0:4,0:4:99:0,104
chrY 2781625 . A G 261.37 VQSRTrancheSNP99.80to100.00 AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=1;AC_SAS_unrel=1;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0.0031348;AF_SAS_unrel=0.00384615;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=60;MQ0=0;QD=29.82;SOR=4.174;VQSLOD=-31.64;culprit=SOR GT:AD:DP:GQ:PL 0:4,0:4:99:0,104
chrY 2781639 . A G 508.37 VQSRTrancheSNP99.80to100.00 AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00446429;AF_AMR_unrel=0.00588235;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=4;FS=0;MQ=58.11;MQ0=0;QD=29.9;SOR=3.383;VQSLOD=-21.19;culprit=SOR GT:AD:DP:GQ:PL 0:4,0:4:99:0,104
chrY 2781749 . G A 441.46 VQSRTrancheSNP99.80to100.00 AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00452489;AF_AMR_unrel=0.00595238;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=446;AN_AFR_unrel=310;AN_AMR=221;AN_AMR_unrel=168;AN_EAS=289;AN_EAS_unrel=241;AN_EUR=298;AN_EUR_unrel=235;AN_SAS=312;AN_SAS_unrel=255;DP=10;FS=0;MQ=16.14;MQ0=0;QD=33.49;SOR=0.859;VQSLOD=-21.68;culprit=MQ GT:AD:DP:GQ:PL 0:10,0:10:90:0,90
chrY 2781758 . C T 865.09 VQSRTrancheSNP99.80to100.00 AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=1;AC_AMR_unrel=1;AC_EAS=1;AC_EAS_unrel=1;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0.00502513;AF_AMR_unrel=0.00657895;AF_EAS=0.003663;AF_EAS_unrel=0.00440529;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=421;AN_AFR_unrel=292;AN_AMR=199;AN_AMR_unrel=152;AN_EAS=273;AN_EAS_unrel=227;AN_EUR=273;AN_EUR_unrel=212;AN_SAS=301;AN_SAS_unrel=250;DP=10;FS=0;MQ=9.36;MQ0=0;QD=26.86;SOR=1.051;VQSLOD=-25.28;culprit=MQ GT:AD:DP:GQ:PL 0:10,0:10:90:0,90
chrY 2781815 . G A 260.37 PASS AC=0;AC_AFR=1;AC_AFR_unrel=1;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0.00218818;AF_AFR_unrel=0.0031348;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=60;MQ0=0;NEGATIVE_TRAIN_SITE;QD=28.93;SOR=1.402;VQSLOD=-6.317;culprit=DP GT:AD:DP:GQ:PL 0:5,0:5:99:0,174
chrY 2781824 . C T 1015.35 PASS AC=0;AC_AFR=2;AC_AFR_unrel=1;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0.00437637;AF_AFR_unrel=0.0031348;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=59.51;MQ0=0;NEGATIVE_TRAIN_SITE;QD=30.77;SOR=1.542;VQSLOD=-7.015;culprit=DP GT:AD:DP:GQ:PL 0:5,0:5:99:0,174
chrY 2782061 . C A 940.35 PASS AC=0;AC_AFR=2;AC_AFR_unrel=2;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=0;AC_EUR_unrel=0;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0.00437637;AF_AFR_unrel=0.00626959;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0;AF_EUR_unrel=0;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=60;MQ0=0;NEGATIVE_TRAIN_SITE;QD=31.35;SOR=1.536;VQSLOD=-6.307;culprit=DP GT:AD:DP:GQ:PL 0:5,0:5:99:0,174
chrY 2782660 . G A 1783.10 PASS AC=0;AC_AFR=0;AC_AFR_unrel=0;AC_AMR=0;AC_AMR_unrel=0;AC_EAS=0;AC_EAS_unrel=0;AC_EUR=3;AC_EUR_unrel=3;AC_SAS=0;AC_SAS_unrel=0;AF=0.00;AF_AFR=0;AF_AFR_unrel=0;AF_AMR=0;AF_AMR_unrel=0;AF_EAS=0;AF_EAS_unrel=0;AF_EUR=0.00983607;AF_EUR_unrel=0.0125;AF_SAS=0;AF_SAS_unrel=0;AN=1;AN_AFR=457;AN_AFR_unrel=319;AN_AMR=224;AN_AMR_unrel=170;AN_EAS=292;AN_EAS_unrel=244;AN_EUR=305;AN_EUR_unrel=240;AN_SAS=319;AN_SAS_unrel=260;DP=5;FS=0;MQ=60;MQ0=0;NEGATIVE_TRAIN_SITE;QD=32.42;SOR=1.371;VQSLOD=-6.652;culprit=DP GT:AD:DP:GQ:PL 0:5,0:5:99:0,174
These are publicly available files, so I can share the input files here. I have zipped both the reference and VCF and attached them here. Please let me know if you cannot download them.
Reference fasta: GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz
Chr Y VCF: 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz
Okay, I think I figured it out. The cigar string is due to the fact that the code assumed a ploidy = 2. I commented out that code. I think that code was for debugging in the first place, so I removed the offending lines. I also fixed the vcf function so that it can read in gzipped files. It ran to completion without that. I pushed those changes to the master branch.
Let me know if that doesn't work.
Great! I re-ran this and it worked with the unzipped VCF (but not with the gzipped VCF).
#-----------------#
# With zipped VCF
#-----------------#
~/miniconda3/envs/NEAT_env/bin/python3.8 /home/amtarave/packages/NEAT_2021-09-15/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid_NEAT_2021-09-15 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz -M 0 -p 1
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...
ERROR: VCF has no header?
20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz
#--------------------#
# With unzipped VCF #
#--------------------#
~/miniconda3/envs/NEAT_env/bin/python3.8 /home/amtarave/packages/NEAT_2021-09-15/NEAT/gen_reads.py -r GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz -R 101 -o NA07048_NEAT_simulated_chrY_10x_haploid_NEAT_2021-09-15 --vcf --pe 300 30 -v 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf -M 0 -p 1
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.fai
--------------------------------
reading input VCF...
found 1273 valid variants in input vcf.
* 167631 variants skipped: (qual filtered / ref genotypes / invalid syntax)
* 0 variants skipped due to multiple variants found per position
--------------------------------
reading chrY...
988.526 (sec)
found 1273 valid variants for chrY in input VCF...
--------------------------------
sampling reads...
[----------]
Read sampling completed in 1284 (sec)
Writing output VCF...
Thank you for your help with this!
Hello, I am using NEAT to simulate paired-end reads using the human reference genome GRCh38 and variants from a VCF from a 1000 Genomes individual for chromosomes Y. When running NEAT, I get the following error:
I am not sure what is causing this
UnboundLocalError: local variable 'buffer_added' referenced before assignment
error.Would you be able to assist me with troubleshooting this?
I am attaching here the reference genome (chromosome Y only) and the VCF I used in the above command.
20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants_NA07048.vcf.gz
GRCh38_full_analysis_set_plus_decoy_hla_chrY.fa.gz
Thank you.
Best, Angela