griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
144 stars 59 forks source link

Annotation of Gene Expression for pVACseq #991

Closed axelgschwind closed 1 year ago

axelgschwind commented 1 year ago

Installation Type

Standalone

pVACtools Version / Docker Image

pvactools 3.1.3 / pvactools 4.0.0

Python Version

python 3.8

Operating System

Ubuntu 20.04.5 LTS

Describe the bug

Hi!

I annotated a somatic VCF file as described in the documentary. I added gene expression information in terms of gene ids. One SNV can affect multiple genes. Thus, the gene expression should be annotated as comma-separated values, as described in the documentation.

However, if one SNV has multiple genes and I annotate multiple comma-separated gene expression values, I get an error as copied into the log output.

I traced the problem to the file "input_file_converter.py". The algorithm of the method "execute(self) tries to unpack the expression values into a tuple of variables. However, it does not go into a loop (which would be necessary in the case of multiple gene_ids, as can also be seen in the case of transcript-based annotation):

for (tag, key, comparison_fields) in zip(['TX', 'GX'], ['transcript_expression', 'gene_expression'], [[transcript_name], [ensembl_gene_id, gene_name]]):
    if tag in self.vcf_reader.header.format_ids():
        if tag in genotype.data:
            expressions = genotype.data[tag]
            if isinstance(expressions, list):
                for expression in expressions:
                    (item, value) = expression.split('|')
                    for comparison_field in comparison_fields:
                        if item == comparison_field:
                            output_row[key] = value
            elif expressions is not None:
                (item, value) = expressions.split('|') ### THE ERROR OCCURS HERE ###
                for comparison_field in comparison_fields:
                    if item == comparison_field:
                        output_row[key] = value

My current workaround is to just use the first coding gene as a reference for the expression. However, I think it is desirable to fix the code or the documentation.

I attached a minimum not-working input VCF.

Cheers,

Axel

How to reproduce this bug

pvacseq run bug_report.vcf.gz DNSRR4289715_01 HLA-A*02:01,HLA-A*11:01,HLA-B*44:02,HLA-B*55:01,HLA-C*03:04,HLA-C*05:01 all bug_report -e1 8,9,10 -e2 15 --normal-sample-name DNSRR4289714_01

Input files

bug_report.vcf.gz

Log output

Executing MHC Class I predictions Converting .vcf to TSV TSV file already exists. Skipping. Converting VCF to TSV /mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/vcfpy/parser.py:251: CannotConvertValue: 0/1 cannot be converted to Float, keeping as string. warnings.warn( Traceback (most recent call last): File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/bin/pvacseq", line 8, in sys.exit(main()) File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/tools/pvacseq/main.py", line 116, in main args[0].func.main(args[1]) File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/tools/pvacseq/run.py", line 133, in main pipeline.execute() File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/lib/pipeline.py", line 424, in execute self.generate_combined_fasta(self.fasta_file_path()) File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/lib/pipeline.py", line 139, in generate_combined_fasta generate_combined_fasta.main(params) File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/tools/pvacseq/generate_protein_fasta.py", line 161, in main proximal_variants_tsv = convert_vcf(args.input_vcf, temp_dir, args.sample_name, args.phased_proximal_variants_vcf, args.flanking_sequence_length, args.pass_only) File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/tools/pvacseq/generate_protein_fasta.py", line 90, in convert_vcf converter.execute() File "/mnt/storage2/users/ahgscha1/phd/tools/pVACseq/.env/lib/python3.8/site-packages/pvactools/lib/input_file_converter.py", line 421, in execute (item, value) = expressions.split('|') ValueError: too many values to unpack (expected 2) make: *** [Makefile:18: create_bug_report] Error 1

Output files

No response

susannasiebert commented 1 year ago

Hi @axelgschwind, Thank you for the bug report. This seems to actually be a problem with the GX header definition in your VCF. It is set to Number=1, instead of Number=.. The latter would indicate to the VCF parser that this field is a list and if isinstance(expressions, list) would evaluate to True. I manually edited your VCF to fix this header and confirmed that the error no longer occurs.

Did you use the vcf-expression-annotator to create this file? I checked the source code for the vcf-expression-annotator and the header is supposed to be written as Number=. so I'm not sure why it's using the wrong format in your case. Can you please confirm which version of vatools you were using? Also, if you could attach the input files you provided for the vcf-expression-annotator and the exact command you ran, that would be great so I can try to replicate this on my end.

axelgschwind commented 1 year ago

Hi @susannasiebert ,

thank you very much for your help! I had written a small script annotating my VCFs. The script caused this small error in the VCF header. But with your help I could fix the error and solve the problem.

Therefore, I'm closing this issue now.

Best,

Axel