mitoNGS / MToolBox

A bioinformatics pipeline to analyze mtDNA from NGS data
http://sourceforge.net/projects/mtoolbox/?source=navbar
GNU General Public License v3.0
90 stars 38 forks source link

"CI_lower;CI_upper" column in .annotation.csv is empty #13

Closed vmorozov closed 8 years ago

vmorozov commented 8 years ago

The "CI_lower;CI_upper" column in .annotation.csv is empty even vcf files have heteroplasmy confidence values

insectopalo commented 8 years ago

The variants_functional_annotation.py script tries to fetch that information during the main_functional_analysis() function. It looks in the VCF dictionary (VCF_dict_tmp) but it's unable to find any of the variants! Hence the blank columns in the ouput file.

Investigating a bit through the code, I see that the keys in the VCF_dict_tmp are position+VARIANT and not position+REFERENCE, which is the string that the function looks for (not finding anything). This could be the reason.

However, I'm also a bit puzzled by why the variants in the VCF output file do not correspond to the variants annotated in the .annotation.csv files. This maybe another reason why the keys are not found within VCF_dict_tmp[sample]: they don't exist. I'd expect to get the annotation for the very same variants reported in the VCF. Perhaps there's something I'm missing.

vmorozov commented 8 years ago

I just can say that it is not limited to the standalone version. "HF","CI_lower;CI_upper" columns are emty in https://mseqdr.org/mtoolbox.php output with example_mito1.bam as input

insectopalo commented 8 years ago

@vmorozov I suggest you to try changing line 89 of file variants_functional_annotation.py from

    ref = ref_seq[snp[0]-1]

to

    ref = snp[1]

Note that if you're using rCRS as a reference sequence, in the annotation file only positions that are variants for that particular reference will have a HF and CI value.

In addition, ensure that your sample id's don't have any dashes '-'.

vmorozov commented 8 years ago

@insectopalo the change you suggested doesn't produce any differences in annotation.csv files. The HF and CI are still empty. I run MToolBox with default options. According to summary....txt it uses the RSRS reference

insectopalo commented 8 years ago

Sorry, it worked for me. But the truth is I have changed more things along the MToolbox pipeline so perhaps there's something else I can't remember now. However, worth to try and ensure that your filenames don't contain dashes, or this could give you problems:

sample_name = contig_file.split('-')[0]
vmorozov commented 8 years ago

I tried to run the MToolBox on on 1000k genome HG00096 low and high covarage samples samtools view -b ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam MT: -o HG00096.bam samtools view -b ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/high_coverage_alignment/HG00096.wgs.ILLUMINA.bwa.GBR.high_cov_pcr_free.20140203.bam MT: -o HG00096h.bam

they get same haplogroup. The annotation.csv files have HF and CI values. But prioritized variants are not same....

Find the results attached summary_20160415_111732.txt prioritized_variants.txt

I am puzzled why I don't get heteroplasmy values on my bam files with MToolBox. I can get expected results (same haplogroup, same homoplasmic and heteroplasmic variants for techical replicates) at https://mtdna-server.uibk.ac.at

clody23 commented 8 years ago

Dear MToolBox users,

fist of all we thank you for the interest in our tool, though we know we need to work a bit more on the pipeline code to improve the error handling

1) Regarding the issue of the lack of CI in the annotation file, @vmorozov could you please provide us:

to better understand the issue. We'll anyway go through the code of the variant_functional_annotation.py to see if there is anything wrong.

2) Just few clarifications about the variant annotation in the MToolBox pipeline. Variants that pop up in the annotation.csv file are usually a subset of the variants in the VCF file, where there are no cutoff for the heteroplasmic fraction (HF) and everything which meets the pipeline filters is reported. The variants in the annotation.csv file are generated starting from an assemble consensus FASTA sequence, based on the GSNAP alignments, that the user creates setting a HF threshold. So if the user does not specify this in the command line, 0.8 HF will be the default value that a variant need to meet to be reported in the consensus FASTA sequence. If the variant is a <0.8, a IUPAC ambiguity will be reported in the consensus. Cause no IUPAC is available for insertions/deletions, these type of variants < HF will be lost in the consensus sequence and in turn the annotation fie. Furthermore, the annotation of variants found is always based on RSRS, therefore if the user chose rCRS as reference for the mapping step, the annotation file will also show rCRS position which vary with respect to RSRS, but without reporting for them any HF and CI, since those alleles couldn't be identified in the rCRS-based VCF file. If no CI and HF is showed for ALL the variant positions (and not only the rCRS-RSRS variant positions) in the annotation file, for sure there might be an issue we should dig into.

3) The criteria to sort variants in the annotation.csv file are based on:

In the case where @vmorozov had the same sample, with high-low coverage, I would expect that the sorting order of the variants in the high-coverage annotation.csv file to be slightly different from low-coverage one, cause there should be more variants detected in the high-cov, maybe with very low nucleotide variability. Furthermore the variants from the low-cov should be a subset of the high-cov, if you ran the same parameters on both the datasets. Can you please provide us this info?

Variants reported in the prioritized_variants.txt file are those from the annotation.csv file identified by all the 3 references used in the annotation process, sorted in the same order as of the annotation.csv.

For your information, we are about to release a new version (1.0) which will try to handle more efficiently variables setup and maybe solve also some of these issues.

We again thank you for inputs and we are happy to take your contributions to the MToolBox code ( through pull requests), in case you have some useful improvement to suggest.

Cheers, Claudia

clody23 commented 8 years ago

This issue should be now fixed with the MToolBox version 1.0