reneshbedre / bioinfokit

Bioinformatics data analysis and visualization toolkit
MIT License
336 stars 77 forks source link

VCF annotations should either be added to the INFO field or the output should be a tab-separated document #10

Closed RamRS closed 4 years ago

RamRS commented 4 years ago

Almost every VCF annotation tool out there adds annotations to either the INFO field or outputs a tab-delimited file (or does both). Adding new non-sample columns to a VCF is not annotation, and it breaks VCF specification, which states that all but the first 8 fixed fields must have genotype information per sample.

Please write your annotations as a tab-delimited output, or add them to the INFO field. Otherwise, the VCF is not usable downstream.

reneshbedre commented 4 years ago

Hi @RamRS,

Thank you for your recommendations. Yes, the annotation information should be added in the INFO field and I know other tools does it (e.g. SnpEff). I used to do the same, but our several collaborators found it difficult to filter the data based on the genomic annotation parameters. Therefore, I have decided to add them as tab-delimited at the end of the sample columns. But, in a future release, I will provide another option to add the annotation in the INFO field.

The annotations are provided as a tab-delimited output. Please, let me know if you could not get it.

Again, thank you for your recommendations. It will help to improve bioinfokit.

RamRS commented 4 years ago

In your documentation, the output is in VCF format, not tab-delimited text.

We can either has a custom tab-delimited annotation file or an annotated VCF file. Annotations can only be added to INFO fields if VCF format is to be maintained. People can use bcftools query to extract annotations in a tabular format (and use the -i/-e option to filter variants of interest), or you can output a .txt/.tsv file like ANNOVAR, VEP, snpEff, etc. It's not a VCF file if it contains these custom columns. The idea is to produce a pipeline-friendly VCF file for downstream processing, and a TSV file for other users that wish to eyeball the annotations directly.

IMO this is a bug, not an enhancement.

j-andrews7 commented 4 years ago

I would second the recommendation that VCF output be maintained, as there exist several tools for converting VCF format to a more readable tab-delimited format. GATK's VariantToTable tool is one option for performing this task quite simply.

reneshbedre commented 4 years ago

@RamRS and @j-andrews7,

I understood your points and will update it in a future release to add the annotation in the INFO field for the VCF file. I will also provide an option to create a tab and/or comma-separated file with additional annotation columns. It will be more useful to handle in excel file and interpret the data,

Thank you for your recommendations.

RamRS commented 4 years ago

Yeah, delimited files are best for Excel - you can exclude the ## headers in the output so it is easier for people to import.

reneshbedre commented 4 years ago

This issue has been fixed in v0.9.4 (provided output as tab-delimited annotated text file)