ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
37 stars 12 forks source link

Problem with vcf_compare_OLD.py #72

Closed georgiiprovisor closed 1 year ago

georgiiprovisor commented 1 year ago

Hello, I tried to use the tool for comparing "golden" vcf with produced by pipeline.

Command was

python NEAT/utilities/vcf_compare_OLD.py -r ../../hg38_res/Ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -g FEMALE_reads/NL_VUMC_KG-002592_gol
den.vcf.gz -w COMPARISON/PER_SAMPLE/GLnexus_on_Haplotypecallerdefault/NL_VUMC_KG-002592_WES.vcf.gz  -o COMPARISON/GLnexus_on_HC_vs_golden/NL_VUMC_KG-002592

And it ends with an error.

SUCCESS: deleting duplicate, unzipped vcf file (NL_VUMC_KG-002592_WES.vcf)
SUCCESS: saving false negative venn plot here (COMPARISON/GLnexus_on_HC_vs_golden/NL_VUMC_KG-002592_FNvenn.pdf)                                                                                           
Traceback (most recent call last):
  File "/gpfs/work3/0/qtholstg/Georgii_tests/NEAT_test/NEAT/utilities/vcf_compare_OLD.py", line 874, in <module>                                                                                          
    main()
  File "/gpfs/work3/0/qtholstg/Georgii_tests/NEAT_test/NEAT/utilities/vcf_compare_OLD.py", line 812, in main                                                                                              
    mpl.savefig(ouf)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/pyplot.py", line 954, in savefig                                                                                
    res = fig.savefig(*args, **kwargs)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/figure.py", line 3274, in savefig                                                                               
    self.canvas.print_figure(fname, **kwargs)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2338, in print_figure                                                                   
    result = print_method(
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2204, in <lambda>                                                                       
    print_method = functools.wraps(meth)(lambda *args, **kwargs: meth(
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backends/backend_pdf.py", line 2808, in print_pdf                                                               
    file = PdfFile(filename, metadata=metadata)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/backends/backend_pdf.py", line 713, in __init__                                                                 
    fh, opened = cbook.to_filehandle(filename, "wb", return_opened=True)
  File "/home/gozhegov/miniconda3/envs/preprocess/lib/python3.10/site-packages/matplotlib/cbook/__init__.py", line 492, in to_filehandle                                                                  
    fh = open(fname, flag, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'COMPARISON/GLnexus_on_HC_vs_golden/NL_VUMC_KG-002592_FNvenn.pdf' 
joshfactorial commented 1 year ago

Does the folder COMPARISON/GLnexus_on_HC_vs_golden currently exist?

georgiiprovisor commented 1 year ago

Oh, it was typo in path. I create new dir with that name and it works.

But the output is still uninformative at all:

TotalGoldenVariants,0 FilteredGoldenVariants,0 MergedGoldenVariants,0 RedundantGoldenVariants,0 TotalWorkflowVariants,0 FilteredWorkflowVariants,0 MergedWorkflowVariants,0 RedundantWorkflowVariants,0 EquivalentVariants,0

And the head of "golden" vcf is here

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    13029   .       T       C       .       PASS    WP=1/0
chr1    13273   .       G       C       .       PASS    WP=1/1
chr1    13417   .       C       CGAGA   .       PASS    WP=1/1
chr1    13649   .       G       C       .       PASS    WP=0/1
chr1    14653   .       C       T       .       PASS    WP=0/1
chr1    14666   .       G       A       .       PASS    WP=0/1
chr1    14677   .       G       A       .       PASS    WP=0/1
chr1    14907   .       A       G       .       PASS    WP=0/1
chr1    14930   .       A       G       .       PASS    WP=0/1

And here is head of compared vcf:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NL_VUMC_KG-002592
chr1    13273   chr1_13273_G_C  G       C       70      .       AC=1;AF=0.500;AN=2;AQ=70;DP=2   GT:AD:DP:GQ:PL:RNC      0/1:0,2:2:0:70,6,0
chr1    13417   chr1_13417_C_CGAGA      C       CGAGA   65      .       AC=1;AF=0.500;AN=2;AQ=65;DP=2   GT:AD:DP:GQ:PL:RNC      0/1:0,2:2:0:65,6,0
chr1    69552   chr1_69552_G_C  G       C       98      .       AC=2;AF=1.00;AN=2;AQ=98;DP=10   GT:AD:DP:GQ:PL:RNC      1/1:0,10:10:25:98,30,0
chr1    139169  chr1_139169_T_C T       C       39      .       AC=1;AF=0.500;AN=2;AQ=39;DP=6   GT:AD:DP:GQ:PL:RNC      0/1:3,3:6:4:39,0,26
chr1    139213  chr1_139213_A_G A       G       92      .       AC=1;AF=0.500;AN=2;AQ=92;DP=7   GT:AD:DP:GQ:PL:RNC      0/1:3,4:7:10:45,0,26
chr1    139233  chr1_139233_C_A C       A       98      .       AC=1;AF=0.500;AN=2;AQ=98;DP=9   GT:AD:DP:GQ:PL:RNC      0/1:4,5:9:10:45,0,26
chr1    139257  chr1_139257_T_C T       C       69      .       AC=1;AF=0.500;AN=2;AQ=69;DP=10  GT:AD:DP:GQ:PL:RNC      0/1:4,6:10:27:69,0,25
chr1    494500  chr1_494500_C_T C       T       73      .       AC=1;AF=0.500;AN=2;AQ=73;DP=6   GT:AD:DP:GQ:PL:RNC      0/1:1,5:6:24:73,0,21
chr1    786333  chr1_786333_G_A G       A       69      .       AC=1;AF=0.500;AN=2;AQ=69;DP=13  GT:AD:DP:GQ:PL:RNC      0/1:6,7:13:30:69,0,29

I tried with "--incl-fail" option, but result is the same.

joshfactorial commented 1 year ago

This older code is very sensitive to chromosome names, as I haven't been able to do much with this section yet. Check that the reference uses the same "chr1" chromosome naming scheme. It may be necessary to simplify the chromosome names in the reference down to just "chr1", "chr2" etc.

georgiiprovisor commented 1 year ago

So, remove everything after 'chr1' in

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

?

Thank you for fast responses! I will try to do it.

joshfactorial commented 1 year ago

Exactly.

georgiiprovisor commented 1 year ago

It works with simplified chr names, thank you!