Open fbemm opened 5 years ago
GFFs provided by Ensembl use this convention ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/
That I know but almost no other annotation (tool) does produce such a format. Plus, putting the prefixes there is redundant with the feature column of the GFF format specification. The feature column can probably considered more stable in its definition than a prefix of the attribute ID field. I can provide a patch if this is considered a better way of identifying GFF_TSCRIPT_LINE and GFF_GENE_LINE.
I am open for this to be changed as long as it continues working with Ensembl files. A more general (and also an easier) solution might be to provide a new script gff2gff
to convert between the various flavors of GFF files.
The latter might be a short term fix but one has to remove gene:
or transcript:
from the annotated VCF in the end again or has to live with the inflated IDs. I just checked the human GFF3:
Testing for a GFF_GENE_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:
grep "ID=gene:" Homo_sapiens.GRCh38.97.chromosome.1.gff3 | cut -f3 | sort | uniq -c
Testing for a GFF_TSCRIPT_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:
Best to stay in sync with SequenceOntology (which Ensembl promotes):
Gene --> http://www.sequenceontology.org/browser/current_release/term/SO:0000704 Transcript --> http://www.sequenceontology.org/browser/current_release/term/SO:0000673
Hi, would be nice if the prefix ("transcript:", "gene:", etc) were optional. This is causing problems in otherwise reasonable GFFs, e.g. : marbl/CHM13#31
There are too many possible variations a GFF can have, I don't want to burden bcftools csq
with that complexity. I will accept a pull request that extends the https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py
script and adds the prefixes when missing.
Petr, thanks for the reply. I'll look into making a PR for the GFF, the latest error is:
Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).
do you have a recommendation for this? There are no phase annotations in the GFF. thanks for any ideas. -Brent
Petr, thanks for the reply. I'll look into making a PR for the GFF, the latest error is:
Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).
do you have a recommendation for this? There are no phase annotations in the GFF. thanks for any ideas. -Brent
@brentp Hi, any updates on this? I'm experiencing the same issue .
Best, DK
Phase is 8th column of GFF https://www.ncbi.nlm.nih.gov/datasets/docs/reference-docs/file-formats/about-ncbi-gff3. The program detected some inconsistency between expected and observed phase (frame).
Hi,
could someone explain me why the feature type for a line in a GFF is not taken from the third GFF filed but
bcftools csq
expect each gene and transcript with a prefix (e.g.,gene:
ortranscript:
)? Inflates GFFs pretty much with redundant information and introduces IDs that are longer than they actually have to be. Guess there is a rational that I just don't get at the moment.Thx, Felix