bcftools csq - GFF Format

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

http://samtools.github.io/bcftools/

Other

681 stars 240 forks source link

bcftools csq - GFF Format #1078

Open fbemm opened 5 years ago

fbemm commented 5 years ago

Hi,

could someone explain me why the feature type for a line in a GFF is not taken from the third GFF filed but bcftools csq expect each gene and transcript with a prefix (e.g., gene: or transcript:)? Inflates GFFs pretty much with redundant information and introduces IDs that are longer than they actually have to be. Guess there is a rational that I just don't get at the moment.

Thx, Felix

Screenshot from 2019-09-03 15-09-07

pd3 commented 5 years ago

GFFs provided by Ensembl use this convention ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/

fbemm commented 5 years ago

That I know but almost no other annotation (tool) does produce such a format. Plus, putting the prefixes there is redundant with the feature column of the GFF format specification. The feature column can probably considered more stable in its definition than a prefix of the attribute ID field. I can provide a patch if this is considered a better way of identifying GFF_TSCRIPT_LINE and GFF_GENE_LINE.

pd3 commented 5 years ago

I am open for this to be changed as long as it continues working with Ensembl files. A more general (and also an easier) solution might be to provide a new script gff2gff to convert between the various flavors of GFF files.

fbemm commented 5 years ago

The latter might be a short term fix but one has to remove gene: or transcript: from the annotated VCF in the end again or has to live with the inflated IDs. I just checked the human GFF3:

Testing for a GFF_GENE_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

gene
ncRNA_gene
pseudogene

grep "ID=gene:" Homo_sapiens.GRCh38.97.chromosome.1.gff3 | cut -f3 | sort | uniq -c

Testing for a GFF_TSCRIPT_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

lnc_RNA
miRNA
mRNA
ncRNA
pseudogenic_transcript
rRNA
scRNA
snoRNA
snRNA
unconfirmed_transcript

Best to stay in sync with SequenceOntology (which Ensembl promotes):

Gene --> http://www.sequenceontology.org/browser/current_release/term/SO:0000704 Transcript --> http://www.sequenceontology.org/browser/current_release/term/SO:0000673

brentp commented 3 years ago

Hi, would be nice if the prefix ("transcript:", "gene:", etc) were optional. This is causing problems in otherwise reasonable GFFs, e.g. : marbl/CHM13#31

pd3 commented 3 years ago

There are too many possible variations a GFF can have, I don't want to burden bcftools csq with that complexity. I will accept a pull request that extends the https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py script and adds the prefixes when missing.

brentp commented 3 years ago

Petr, thanks for the reply. I'll look into making a PR for the GFF, the latest error is:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

do you have a recommendation for this? There are no phase annotations in the GFF. thanks for any ideas. -Brent

dKlee99 commented 3 years ago

Petr, thanks for the reply. I'll look into making a PR for the GFF, the latest error is:
Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).
do you have a recommendation for this? There are no phase annotations in the GFF. thanks for any ideas. -Brent

@brentp Hi, any updates on this? I'm experiencing the same issue .

Best, DK

pd3 commented 3 years ago

Phase is 8th column of GFF https://www.ncbi.nlm.nih.gov/datasets/docs/reference-docs/file-formats/about-ncbi-gff3. The program detected some inconsistency between expected and observed phase (frame).