sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
66 stars 13 forks source link

VCF contig error #130

Closed evyamor closed 5 months ago

evyamor commented 6 months ago

Hello, when using an outside prepared VCF file ( .vcf.gz & .tbi ) and executing 'run-ngs-pipeline' I keep getting this type of error:

Traceback (most recent call last): File "/home/ubuntu/.local/bin/pypgx", line 8, in sys.exit(main()) File "/home/ubuntu/.local/lib/python3.8/site-packages/pypgx/main.py", line 33, in main commands[args.command].main(args) File "/home/ubuntu/.local/lib/python3.8/site-packages/pypgx/cli/run_ngs_pipeline.py", line 159, in main pipeline.run_ngs_pipeline( File "/home/ubuntu/.local/lib/python3.8/site-packages/pypgx/api/pipeline.py", line 230, in run_ngs_pipeline imported_variants = utils.import_variants(gene, variants, File "/home/ubuntu/.local/lib/python3.8/site-packages/pypgx/api/utils.py", line 1041, in import_variants vf = pyvcf.VcfFrame.from_file(vcf, regions=region) File "/home/ubuntu/.local/lib/python3.8/site-packages/fuc/api/pyvcf.py", line 2356, in from_file s = slice(fn, regions) File "/home/ubuntu/.local/lib/python3.8/site-packages/fuc/api/pyvcf.py", line 1394, in slice for record in vcf.fetch(chrom, start, end): File "pysam/libcbcf.pyx", line 4466, in pysam.libcbcf.VariantFile.fetch File "pysam/libchtslib.pyx", line 683, in pysam.libchtslib.HTSFile.parse_region ValueError: invalid contig 12

For all contigs. Note, all entries have 'chr' prefix, e.g. chr12 And the contigs themselves looks like

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig= ...........

Both in a VCF prepared with DRAGEN, and for comparison purposes also a generated VCF from a BAM file with PYPGX. I'm trying to understand what's causing this error and how to fix this issue In both VCFs the contigs and entries are written the same way.

Thanks for your time, Evyatar

sbslee commented 6 months ago

Hi @evyamor,

Could you please send me the input VCF file and the exact command line you used to replicate the error at sbstevenlee@gmail.com? It is difficult to solve the problem without seeing the data.

evyamor commented 6 months ago

Hey, Thank you for the response. I sent the VCF file and the CLI to your email.

Much appreciated, Thank you, Evyatar

sbslee commented 5 months ago

@evyamor,

Thank you so much for providing the requested information! It really helped me identify the issue. The problem was caused by the HLA contigs present in your input VCF. During the pre-processing step, PyPGx first checks whether the input VCF has the chr string or not (e.g., chr2 vs. 2). Your VCF file happens to have variants from the HLA contigs as the top records:

...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  KU_01539
HLA-DRB1*03:01:01:01    7926    .   C   G   4.45    .   AC=1;AF=0.5;AN=2;DP=21;FS=2.172;MQ=98.91;MQRankSum=-3.344;QD=0.96;ReadPosRankSum=2.106;SOR=1.609;FractionInformativeReads=1 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB    0/1:16,5:0.238:21:7,0:9,5:4:37,0,24:4.4539,1.9373,28.946:0,34.77,37.77:7,9,1,4:8,8,4,1
...
chr1    19776   .   A   G   27.21   .   AC=2;AF=1;AN=2;DP=1;FS=0;MQ=10;MQRankSum=0;QD=7.78;ReadPosRankSum=0;SOR=1.609;FractionInformativeReads=1    GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB    1/1:0,1:1:1:0,1:0,0:24:65,29,0:27.21,26.131,0.018882:0,34.77,37.77:0,0,0,1:0,0,0,1
...

When PyPGx saw this, it automatically determined that the input VCF file does not have the chr string, when clearly your VCF file does. Therefore, I updated the fuc.pyvcf.has_chr_prefix method (see the commit) in the 0.38.0-dev branch to ignore the HLA contigs when determining whether the input VCF has the chr string. I confirm that after this fix everything works as expected (see the PyPGx results for CYP2C9 I sent you via email, as an example).

At this point, you have two options: You can either use the development version of the fuc package (i.e., 0.38.0-dev) or you can remove the HLA variants from your VCF file. Both should work.

Thank you again for reporting the issue. Please let me know if you have any questions.

Best regards, Steven

evyamor commented 5 months ago

Thank you very much for checking these files and fixing the necessary functions. I will remove the HLA contigs from my current VCF file as suggested. Your assistance has been invaluable and is most appreciated. Thank you again.

Best regards, Evyatar

sbslee commented 5 months ago

@evyamor,

FYI, the 0.38.0 version of the fuc package has just been released. See this PR for details. You don't need to remove the HLA contigs from the VCF file anymore.