Closed taranewman closed 5 months ago
Hi @taranewman ,
Thanks for reporting this. I'll take a look now.
@taranewman , can you give me the command you used to download the reads and run tb-profiler? I ran it through without any error on my system but maybe there might be some differences in our commands
Hi @jodyphelan ,
Thanks so much for taking a look.
These are the commands:
Download the reads:
curl -L http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR108/015/SRR10869015/SRR10869015_1.fastq.gz -o SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_1.fastq.gz
curl -L http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR108/015/SRR10869015/SRR10869015_2.fastq.gz -o SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_2.fastq.gz
Run fastp
fastp --cut_tail -i SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_1.fastq.gz -I SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_2.fastq.gz -o SRR10869015_trimmed_R1.fastq.gz -O SRR10869015_trimmed_R2.fastq.gz
Run tb-profiler:
tb-profiler profile --threads 8 --platform illumina --mapper bwa --caller bcftools --depth 10 --af 0.1 --read1 SRR10869015_trimmed_R1.fastq.gz --read2 SRR10869015_trimmed_R1.fastq.gz --prefix SRR10869015 --csv --call_whole_genome
Please let me know if you need any other info!
Thanks, running those commands still seems to work for me to it might be the versions of one of the packages?
If you are using conda can you export the environemtent with conda list --explicit
and attach the file here?
Thanks for testing that out! Attached is the conda environment.
This has bcftools v1.20. I've tried downgrading this environment to use bcftools v1.12 which was the version we were using prior to the update but didn't have luck there.
Oddly enough it still seems to run fine for me
Could you run it with --no_clean
and send the targets_for_profile.vcf.gz file?
Good to know, thank you for your time checking this!
I am running this within a nextflow wrapper so I'll look further on my side if there is something with nextflow that could be causing this.
Hi again,
I manually ran
bcftools view -c 1 -a targets_for_profile.vcf.gz | bcftools view -v snps
to produce BCFTOOLS_INPUT_FOR_SCRIPT.vcf.gz
I then used this vcf file as input into the combine_vcf_variants.py script within a jupyter notebook.
I'm not sure why the error only occurred when running with the nextflow wrapper, but it looks like 'AF' is missing from the variant info here:
Do you think adding a line if 'AF' in variant.info:
, similar to the 'DP4' line, could be appropriate here?
One thing I did notice was that the vcf file produced when running tbprofiler on this SRA sample outside of nextflow was that the VCF file was empty
Whereas the file produced when running the same command within nextflow had many more SNPs:
In this issue here it seems the AF column may need to be calculated first by the AN and AC columns? https://github.com/samtools/bcftools/issues/1060
Hi again, just double checking - are you using bcftools
to variant call or the freebayes
default?
If you are using bcftools
then it might explain the error as the AF attribute isn't generated and as a consequence when the combine_vcf_variants.py
script tries to add it, it fails due to it not being defined in the vcf header. A fix which should work is to add the definition if it doesnt exist.
if "AF" not in vcf.header.info.keys():
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
vcf.header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">')
Seems to work on my end.
Bcftools also seems to be giving an empty vcf for me so I'll investigate that.
Thanks Jody! Yes, I am using bcftools.
Thanks, I think we're getting closer.
When bcftools doesn't call any variants then the combine_vcf_variants.py
script doesn't have any variants to analyse and doesn't cause any error. Which might explain why you sometimes get the error and sometimes not.
Why bcftools isn't working I'm not entirely sure yet. I noticed in your commands you seem to be using the forward read twice, is this a typo?
tb-profiler profile --threads 8 --platform illumina --mapper bwa --caller bcftools --depth 10 --af 0.1 --read1 SRR10869015_trimmed_R1.fastq.gz --read2 SRR10869015_trimmed_R1.fastq.gz --prefix SRR10869015 --csv --call_whole_genome
This might explain why bcftools doesn't call any variants.
Oops so sorry! Yes that was a typo, thanks for noticing that!
When I fix the command to use the reverse read, I'm getting the same variants in the vcf file and AF error I got when running within the nextflow wrapper. Mystery solved :)
Great, I'll release the patch this week
Great, I'll release the patch this week
Thank you Jody! Will this be a pathogen-profiler v4.1.1 patch release?
I've made a release as v4.2.0 as there were are few bigger things I changed in pathogen-profiler. This will be paired with tb-profiler v6.2.1. Should be available on conda tomorrow.
Hello,
I came across an error with the Pathogen Profiler combine_vcf_variants.py script that seems to occur in approximately half of my samples with TBProfiler v6.2.0. The same samples previously ran successfully using v4.3.0. The samples causing this error don't appear to have a clear lineage/QC pattern.
An SRA sample that produces this error is SRR10869015
If line 171 is commented out, then everything appears to run fine.
System specifications: conda, Linux HPC, SLURM