Issue with Pathogen Profiler combine_vcf_variants.py script

taranewman commented 5 months ago

Hello,

I came across an error with the Pathogen Profiler combine_vcf_variants.py script that seems to occur in approximately half of my samples with TBProfiler v6.2.0. The same samples previously ran successfully using v4.3.0. The samples causing this error don't appear to have a clear lineage/QC pattern.

An SRA sample that produces this error is SRR10869015

If line 171 is commented out, then everything appears to run fine.

System specifications: conda, Linux HPC, SLURM

Command Failed:
/bin/bash -c set -o pipefail; bcftools view -c 1 -a <>targets_for_profile.vcf.gz | bcftools view -v snps | combine_vcf_variants.py --ref tbprofiler//tbdb.fasta --gff tbdb.gff --bam <>.bam |  snpEff ann -dataDir snpeff-5.2-0/data -noLog -noStats Mycobacterium_tuberculosis_h37rv -  | bcftools sort -Oz -o <>.vcf.gz && bcftools index <>vcf.gz
stderr:
Writing to ...
Traceback (most recent call last):
  File "<path to conda env> bin/combine_vcf_variants.py", line 171, in <module>
    variant.info.update({'AF':count/dp})
  File "pysam/libcbcf.pyx", line 2798, in pysam.libcbcf.VariantRecordInfo.update
  File "pysam/libcbcf.pyx", line 2621, in pysam.libcbcf.VariantRecordInfo.__setitem__
  File "pysam/libcbcf.pyx", line 698, in pysam.libcbcf.bcf_info_set_value
KeyError: 'unknown INFO: AF'
[E::bcf_hdr_read] Input is not detected as bcf or vcf format
Could not read VCF/BCF headers from -
Cleaning

jodyphelan commented 5 months ago

Hi @taranewman ,

Thanks for reporting this. I'll take a look now.

jodyphelan commented 5 months ago

@taranewman , can you give me the command you used to download the reads and run tb-profiler? I ran it through without any error on my system but maybe there might be some differences in our commands

taranewman commented 5 months ago

Hi @jodyphelan ,

Thanks so much for taking a look.

These are the commands:

Download the reads: curl -L http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR108/015/SRR10869015/SRR10869015_1.fastq.gz -o SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_1.fastq.gz

curl -L http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR108/015/SRR10869015/SRR10869015_2.fastq.gz -o SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_2.fastq.gz

Run fastp fastp --cut_tail -i SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_1.fastq.gz -I SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_2.fastq.gz -o SRR10869015_trimmed_R1.fastq.gz -O SRR10869015_trimmed_R2.fastq.gz

Run tb-profiler:

tb-profiler profile --threads 8 --platform illumina --mapper bwa --caller bcftools --depth 10 --af 0.1 --read1 SRR10869015_trimmed_R1.fastq.gz --read2 SRR10869015_trimmed_R1.fastq.gz --prefix SRR10869015 --csv --call_whole_genome

Please let me know if you need any other info!

jodyphelan commented 5 months ago

Thanks, running those commands still seems to work for me to it might be the versions of one of the packages? If you are using conda can you export the environemtent with conda list --explicit and attach the file here?

taranewman commented 5 months ago

tbprofiler-conda-env.txt

Thanks for testing that out! Attached is the conda environment.

This has bcftools v1.20. I've tried downgrading this environment to use bcftools v1.12 which was the version we were using prior to the update but didn't have luck there.

jodyphelan commented 5 months ago

Oddly enough it still seems to run fine for me

Could you run it with --no_clean and send the targets_for_profile.vcf.gz file?

taranewman commented 5 months ago

targets_for_profile.vcf.gz

Good to know, thank you for your time checking this!

I am running this within a nextflow wrapper so I'll look further on my side if there is something with nextflow that could be causing this.

taranewman commented 5 months ago

Hi again,

I manually ran

bcftools view -c 1 -a targets_for_profile.vcf.gz | bcftools view -v snps

to produce BCFTOOLS_INPUT_FOR_SCRIPT.vcf.gz

I then used this vcf file as input into the combine_vcf_variants.py script within a jupyter notebook.

I'm not sure why the error only occurred when running with the nextflow wrapper, but it looks like 'AF' is missing from the variant info here:

Do you think adding a line if 'AF' in variant.info: , similar to the 'DP4' line, could be appropriate here?

taranewman commented 5 months ago

One thing I did notice was that the vcf file produced when running tbprofiler on this SRA sample outside of nextflow was that the VCF file was empty

Whereas the file produced when running the same command within nextflow had many more SNPs:

In this issue here it seems the AF column may need to be calculated first by the AN and AC columns? https://github.com/samtools/bcftools/issues/1060

jodyphelan commented 5 months ago

Hi again, just double checking - are you using bcftools to variant call or the freebayes default? If you are using bcftools then it might explain the error as the AF attribute isn't generated and as a consequence when the combine_vcf_variants.py script tries to add it, it fails due to it not being defined in the vcf header. A fix which should work is to add the definition if it doesnt exist.

if "AF" not in vcf.header.info.keys():
    ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
    vcf.header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">')

Seems to work on my end.

Bcftools also seems to be giving an empty vcf for me so I'll investigate that.

taranewman commented 5 months ago

Thanks Jody! Yes, I am using bcftools.

jodyphelan commented 5 months ago

Thanks, I think we're getting closer.

When bcftools doesn't call any variants then the combine_vcf_variants.py script doesn't have any variants to analyse and doesn't cause any error. Which might explain why you sometimes get the error and sometimes not.

Why bcftools isn't working I'm not entirely sure yet. I noticed in your commands you seem to be using the forward read twice, is this a typo?

tb-profiler profile --threads 8 --platform illumina --mapper bwa --caller bcftools --depth 10 --af 0.1 --read1 SRR10869015_trimmed_R1.fastq.gz  --read2 SRR10869015_trimmed_R1.fastq.gz  --prefix SRR10869015  --csv  --call_whole_genome

This might explain why bcftools doesn't call any variants.

taranewman commented 5 months ago

Oops so sorry! Yes that was a typo, thanks for noticing that!

When I fix the command to use the reverse read, I'm getting the same variants in the vcf file and AF error I got when running within the nextflow wrapper. Mystery solved :)

jodyphelan commented 5 months ago

Great, I'll release the patch this week

taranewman commented 4 months ago

Great, I'll release the patch this week

Thank you Jody! Will this be a pathogen-profiler v4.1.1 patch release?

jodyphelan commented 4 months ago

I've made a release as v4.2.0 as there were are few bigger things I changed in pathogen-profiler. This will be paired with tb-profiler v6.2.1. Should be available on conda tomorrow.

jodyphelan / TBProfiler

Issue with Pathogen Profiler combine_vcf_variants.py script #345