Open johnlees opened 7 years ago
The error messages actually do indicate the problem: the "Parent=transcript:" keyword is not found in some records, see an example of a supported file format here: ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz
Maybe there could be a small perl script added for convertin the most common GFF variants to a format supported by bcftools/csq.
Ah I see, thanks a lot. Such a perl script would certainly be useful, but I'd guess it'd be difficult to cover all the variants of GFF files.
I will write one for ensembl bacteria files at some point, and attach it here in case it's useful to others in the future.
That would be great, thanks
@pd3 @johnlees I encountered this problem today too in my attempts to find an alternative to snpEff
.
Clearly "Parent=transcript:" not present
means it wants that tag. There does seem to be information in the source code comments about the expected gene mode:
https://github.com/samtools/bcftools/blob/develop/csq.c
The gff parsing logic
We collect features such by combining gff lines A,B,C as follows:
A .. gene line with a supported biotype
A.ID=~/^gene:/
B .. transcript line referencing A
B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
C .. corresponding CDS, exon, and UTR lines:
C[3] in {"CDS","exon","three_prime_UTR","five_prime_UTR"} && C.Parent=~/^transcript:B.ID/
For coding biotypes ("protein_coding" or "polymorphic_pseudogene") the
complete chain link C -> B -> A is required. For the rest, link B -> A suffices.
It's even in perl regexp form so I can read it :-P
Here's a CDS from chrY
formatted for brevity
gene 2786855 2787699 . - . ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gen>
mRNA 2786855 2787699 . - . ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_>
3'_UTR 2786855 2786988 . - . Parent=transcript:ENST00000383070
exon 2786855 2787699 . - . Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001494622;ra>
CDS 2786989 2787603 . - 0 ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
5'_UTR 2787604 2787699 . - . Parent=transcript:ENST00000383070
The documentation excerpt pasted above states that gene, transcript and CDS lines are required, but in your example the transcript is not shown. (By the way, the program must have printed also the line where the substring "Parent=transcript:"
was not found, so you should be able to check where the program got stuck.)
Here is the expected format in more detail:
# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
# (the most interesting is "protein_coding").
#
# Attributes required for
# gene lines:
# - ID=gene:<gene_id>
# - biotype=<biotype>
# - Name=<gene_name> [optional]
#
# transcript lines:
# - ID=transcript:<transcript_id>
# - Parent=gene:<gene_id>
# - biotype=<biotype>
#
# other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
# - Parent=transcript:<transcript_id>
#
# Supported biotypes:
# - see the function gff_parse_biotype() in bcftools/csq.c
1 ignored_field gene 21 2148 . - . ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1 ignored_field transcript 21 2148 . - . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1 ignored_field three_prime_UTR 21 2054 . - . Parent=transcript:TranscriptId
1 ignored_field exon 21 2148 . - . Parent=transcript:TranscriptId
1 ignored_field CDS 21 2148 . - 1 Parent=transcript:TranscriptId
1 ignored_field five_prime_UTR 210 2148 . - . Parent=transcript:TranscriptId
And many more examples can be found in bcftools/test/csq/*/*.gff
.
If you are going to write a gff2gff
conversion script I am happy to host it in bcftools/misc
. This should be straightforward really, looking at your example you'd just need to codify the naming (note the UTR notation) and add a transcript for each CDS.
I finally managed to get it working. What the perl rules example doesn't say is that each of the lines must also have biotype=protein_coding
(or other approved type).
This is what worked:
gene ID=gene:foo;biotype=protein_coding
mRNA ID=transcript:foo;Parent=gene:foo;biotype=protein_coding
CDS ID=CDS:foo;Parent=mRNA:foo;biotype=protein_coding
On my mini example:
Indexed 63 transcripts, 0 exons, 63 CDSs, 0 UTRs
BCSQ=frameshift|ILHGENPG_00058|ILHGENPG_00058|protein_coding|+|434VHASSMLESARSITRSTLVGGHAGGNGGIKND*>434VLLQC*|53660GTACAT>GT
Bacterial GFF files come in 2 flavours
gene
+ CDS
/rRNA
/etc feature (NCBI style)CDS
/rRNA
/etc feature.However this is now changing as more predictor tools and experimental evidence is being used for the UTRs like promoters and terminators etc.
My popular annotation tool Prokka can produce either. I just need to make an option to output in bcftools csq
format.
My script is currently dependent on Bio::Perl
and only works fromgbk2gff
, I still need to write a gff2gff
one, but I am happy to contribute a pull request soon.
@pd3 I notice the {transcript,gene,CDS}:
part of the ID
is not reported in the BCSQ
line as part of the ID
. Forgive my ignorance, but I assume this type:
notation is a semi-formalised naming convention in the model organism community?
This is just a practical decision to avoid the complexity of parsing of the many GFF flavors that exist out there. Ensembl's human GFF is one of the most frequently used, so that became the default.
Note that contrary to your statement above, the biotype attribute does not need to be present for each line, only the gene and transcript lines. The GFF description is now included in the manual page https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098
@pd3 thanks so much for the clarification of the biotype
and for the extra docs.
the htslib teams on github are so helpful and responsive. thank you!
Thanks for the discussion above! It's very helpful! Please allow me to do a quick summary, and point out that spelling and letter case needs to be exact match:
column 3 (type) column 9 (attributes)
xxx ID=gene:xxx;biotype=xxx
xxx ID=transcript:xxx;Parent=gene:xxx;biotype=xxx
exon Parent=transcript:xxx
CDS Parent=transcript:xxx
five_prime_UTR Parent=transcript:xxx
three_prime_UTR Parent=transcript:xxx
Gene line and transcript line are identified by column 9 having ID=gene
or ID=transcript
, so column 3 can be anything (mRNA, lncRNA, etc.).
List of supported biotype here: https://github.com/samtools/bcftools/blob/develop/csq.c
biotype=protein_coding
is most useful.
Thank you for developing these wonderful tools!
@pd3 @tseemann @johnlees did any of you end up with a script to convert GFF3 to bcftools csq compatible Ensembl GFF3?
The only code made publicly available I am aware of is here https://github.com/samtools/bcftools/issues/1208#issuecomment-620642373. I'd be very happy to host a gff2gff
script in bcftools/misc
but cannot commit to developing it myself unfortunately. Even if its functionality is limited, it would still be a worthwhile contribution.
Thank you @pd3! If I roll my own I'll post it here at least.
This is my script. Takes a gff converted from gbk with the BioPerl gbk2gff3.pl script. Outputs something which can be used by bcftools csq. It's probably only useful as a starting point for others with the same problem to work from, but it works from my use case.
https://gist.github.com/flashton2003/b246ce509300a8669d27de7a4eb5c4c9
Sorry, I don't have time to make something more general.
This script has been now added, thank you @flashton2003
You're welcome!
On Mon, Sep 7, 2020 at 6:02 PM Petr Danecek notifications@github.com wrote:
This script has been now added, thank you @flashton2003 https://github.com/flashton2003
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/bcftools/issues/530#issuecomment-688411075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAT6FEWRFK5ALQCZKA5OABDSET7ZFANCNFSM4C2H3UEA .
I have tried the bcftools csq command with the attached fasta, and three different gff files (one from ncbi, one ncbi modified to add transcripts, and one from ensembl bacteria). In all cases the gff cannot be parsed, and I haven't found the error message informative as to why.
Files to replicate this are attached here: bcftools_csq.zip
The commands I tried with their stderr are in bcftools_csq.stderr