bcftools csq cannot parse bacterial gff3 files

johnlees commented 7 years ago

I have tried the bcftools csq command with the attached fasta, and three different gff files (one from ncbi, one ncbi modified to add transcripts, and one from ensembl bacteria). In all cases the gff cannot be parsed, and I haven't found the error message informative as to why.

Files to replicate this are attached here: bcftools_csq.zip

The commands I tried with their stderr are in bcftools_csq.stderr


Parsing Streptococcus_pneumoniae_atcc_700669.ASM2666v1.32.chromosome.Chromosome.gff3.gz ...
[csq.c:699 gff_parse_id] Could not parse the line: Chromosome   ena gene    186 1547    .   +   .   ID=gene:SPN23F00010;Name=dnaA;biotype=protein_coding;description=chromosomal replication initiator protein;gene_id=SPN23F00010;logic_name=ena

bcftools csq -f Streptococcus_pneumoniae_ATCC_700669_v1.fa -g transcripts.gtf dhh.vcf.gz 2>> bcftools_csq.stderr
Parsing transcripts.gtf ...
[csq.c:692 gff_parse_id] Could not parse the line, "Parent=transcript:" not present: FM211187   protein_coding  exon    186 1547    .   +   0   gene_id ""FM211187.2.1""; transcript_id ""dnaA""; exon_number "1"; 

bcftools csq -f Streptococcus_pneumoniae_ATCC_700669_v1.fa -g Streptococcus_pneumoniae_ATCC_700669_v1.gff dhh.vcf.gz 2>> bcftools_csq.stderr
Parsing Streptococcus_pneumoniae_ATCC_700669_v1.gff ...
ignore: FM211187    EMBL    databank_entry  1   2221315 0.000   +   .   ID="FM211187.1";organism="Streptococcus pneumoniae ATCC 700669";strain="ATCC 700669";mol_type="genomic DNA";db_xref="taxon:561276"
[csq.c:692 gff_parse_id] Could not parse the line, "Parent=transcript:" not present: FM211187   EMBL    CDS 186 1547    0.000   +   0   ID="FM211187.2";transl_table=11;gene="dnaA";locus_tag="SPN23F00010";product="chromosomal replication initiator protein";db_xref="GOA:B8ZJH9";db_xref="InterPro:IPR001957";db_xref="InterPro:IPR003593";db_xref="InterPro:IPR010921";db_xref="InterPro:IPR013159";db_xref="InterPro:IPR013317";db_xref="InterPro:IPR018312";db_xref="InterPro:IPR020591";db_xref="UniProtKB/Swiss-Prot:B8ZJH9";protein_id="CAR67867.1";translation="MKEKQFWNRILEFAQERLTRSMYDFYAIQAELIKVEENVATIFLPRSEMEMVWEKQLKDIIVVAGFEIYDAEITPHYIFTKPQDTTSSQVEEATNLTLYDYSPKLVSIPYSDTGLKEKYTFDNFIQGDGNVWAVSAALAVSEDLALTYNPLFIYGGPGLGKTHLLNAIGNEILKNIPNARVKYIPAESFINDFLDHLRLGEMEKFKKTYRSLDLLLIDDIQSLSGKKVATQEEFFNTFNALHDKQKQIVLTSDRSPKHLEGLEERLVTRFSWGLTQTITPPDFETRIAILQSKTEHLGYNFQSDTLEYLAGQFDSNVRDLEGAINDITLIARVKKIKDITIDIAAEAIRARKQDVSQMLVIPIDKIQTEVGNFYGVSIKEMKGSRRLQNIVLARQVAMYLSRELTDNSLPKIGKEFGGKDHTTVIHAHAKIKSLIDQDDNLRLEIESIKKKIK"```

pd3 commented 7 years ago

The error messages actually do indicate the problem: the "Parent=transcript:" keyword is not found in some records, see an example of a supported file format here: ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz

Maybe there could be a small perl script added for convertin the most common GFF variants to a format supported by bcftools/csq.

johnlees commented 7 years ago

Ah I see, thanks a lot. Such a perl script would certainly be useful, but I'd guess it'd be difficult to cover all the variants of GFF files.

I will write one for ensembl bacteria files at some point, and attach it here in case it's useful to others in the future.

pd3 commented 7 years ago

That would be great, thanks

tseemann commented 6 years ago

@pd3 @johnlees I encountered this problem today too in my attempts to find an alternative to snpEff.

Clearly "Parent=transcript:" not present means it wants that tag. There does seem to be information in the source code comments about the expected gene mode:

https://github.com/samtools/bcftools/blob/develop/csq.c

    The gff parsing logic
        We collect features such by combining gff lines A,B,C as follows:
            A .. gene line with a supported biotype
                    A.ID=~/^gene:/
            B .. transcript line referencing A
                    B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
            C .. corresponding CDS, exon, and UTR lines:
                    C[3] in {"CDS","exon","three_prime_UTR","five_prime_UTR"} && C.Parent=~/^transcript:B.ID/ 
        For coding biotypes ("protein_coding" or "polymorphic_pseudogene") the
        complete chain link C -> B -> A is required. For the rest, link B -> A suffices.

It's even in perl regexp form so I can read it :-P

Here's a CDS from chrY formatted for brevity

gene    2786855 2787699 .       -       .       ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gen>
mRNA    2786855 2787699 .       -       .       ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_>
3'_UTR  2786855 2786988 .       -       .       Parent=transcript:ENST00000383070
exon    2786855 2787699 .       -       .       Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001494622;ra>
CDS     2786989 2787603 .       -       0       ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
5'_UTR  2787604 2787699 .       -       .       Parent=transcript:ENST00000383070

pd3 commented 6 years ago

The documentation excerpt pasted above states that gene, transcript and CDS lines are required, but in your example the transcript is not shown. (By the way, the program must have printed also the line where the substring "Parent=transcript:" was not found, so you should be able to check where the program got stuck.)

Here is the expected format in more detail:

# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
# (the most interesting is "protein_coding").
#
# Attributes required for 
#   gene lines:
#   - ID=gene:<gene_id> 
#   - biotype=<biotype>
#   - Name=<gene_name>      [optional]
#
#   transcript lines:
#   - ID=transcript:<transcript_id>
#   - Parent=gene:<gene_id>
#   - biotype=<biotype>
#
#   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
#   - Parent=transcript:<transcript_id>
#
# Supported biotypes:
#   - see the function gff_parse_biotype() in bcftools/csq.c

1   ignored_field  gene            21  2148  .   -   .   ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1   ignored_field  transcript      21  2148  .   -   .   ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1   ignored_field  three_prime_UTR 21  2054  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  exon            21  2148  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  CDS             21  2148  .   -   1   Parent=transcript:TranscriptId
1   ignored_field  five_prime_UTR  210 2148  .   -   .   Parent=transcript:TranscriptId

And many more examples can be found in bcftools/test/csq/*/*.gff.

If you are going to write a gff2gff conversion script I am happy to host it in bcftools/misc. This should be straightforward really, looking at your example you'd just need to codify the naming (note the UTR notation) and add a transcript for each CDS.

tseemann commented 6 years ago

I finally managed to get it working. What the perl rules example doesn't say is that each of the lines must also have biotype=protein_coding (or other approved type).

This is what worked:

gene  ID=gene:foo;biotype=protein_coding
mRNA  ID=transcript:foo;Parent=gene:foo;biotype=protein_coding
CDS   ID=CDS:foo;Parent=mRNA:foo;biotype=protein_coding

On my mini example:

Indexed 63 transcripts, 0 exons, 63 CDSs, 0 UTRs

BCSQ=frameshift|ILHGENPG_00058|ILHGENPG_00058|protein_coding|+|434VHASSMLESARSITRSTLVGGHAGGNGGIKND*>434VLLQC*|53660GTACAT>GT

Bacterial GFF files come in 2 flavours

Every gene has a gene + CDS/rRNA/etc feature (NCBI style)
Every gene only has the CDS/rRNA/etc feature.

However this is now changing as more predictor tools and experimental evidence is being used for the UTRs like promoters and terminators etc.

My popular annotation tool Prokka can produce either. I just need to make an option to output in bcftools csq format.

My script is currently dependent on Bio::Perl and only works fromgbk2gff, I still need to write a gff2gff one, but I am happy to contribute a pull request soon.

tseemann commented 6 years ago

@pd3 I notice the {transcript,gene,CDS}: part of the ID is not reported in the BCSQ line as part of the ID. Forgive my ignorance, but I assume this type: notation is a semi-formalised naming convention in the model organism community?

pd3 commented 6 years ago

This is just a practical decision to avoid the complexity of parsing of the many GFF flavors that exist out there. Ensembl's human GFF is one of the most frequently used, so that became the default.

Note that contrary to your statement above, the biotype attribute does not need to be present for each line, only the gene and transcript lines. The GFF description is now included in the manual page https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098

tseemann commented 6 years ago

@pd3 thanks so much for the clarification of the biotype and for the extra docs. the htslib teams on github are so helpful and responsive. thank you!

danrlu commented 4 years ago

Thanks for the discussion above! It's very helpful! Please allow me to do a quick summary, and point out that spelling and letter case needs to be exact match:

column 3 (type)                    column 9 (attributes)
xxx                                ID=gene:xxx;biotype=xxx
xxx                                ID=transcript:xxx;Parent=gene:xxx;biotype=xxx
exon                               Parent=transcript:xxx
CDS                                Parent=transcript:xxx
five_prime_UTR                     Parent=transcript:xxx
three_prime_UTR                    Parent=transcript:xxx

Gene line and transcript line are identified by column 9 having ID=gene or ID=transcript, so column 3 can be anything (mRNA, lncRNA, etc.).

List of supported biotype here: https://github.com/samtools/bcftools/blob/develop/csq.c biotype=protein_coding is most useful.

Thank you for developing these wonderful tools!

flashton2003 commented 4 years ago

@pd3 @tseemann @johnlees did any of you end up with a script to convert GFF3 to bcftools csq compatible Ensembl GFF3?

pd3 commented 4 years ago

The only code made publicly available I am aware of is here https://github.com/samtools/bcftools/issues/1208#issuecomment-620642373. I'd be very happy to host a gff2gff script in bcftools/misc but cannot commit to developing it myself unfortunately. Even if its functionality is limited, it would still be a worthwhile contribution.

flashton2003 commented 4 years ago

Thank you @pd3! If I roll my own I'll post it here at least.

flashton2003 commented 4 years ago

This is my script. Takes a gff converted from gbk with the BioPerl gbk2gff3.pl script. Outputs something which can be used by bcftools csq. It's probably only useful as a starting point for others with the same problem to work from, but it works from my use case.

https://gist.github.com/flashton2003/b246ce509300a8669d27de7a4eb5c4c9

Sorry, I don't have time to make something more general.

Expected input here, expected output here.

pd3 commented 4 years ago

This script has been now added, thank you @flashton2003

flashton2003 commented 4 years ago

You're welcome!

On Mon, Sep 7, 2020 at 6:02 PM Petr Danecek notifications@github.com wrote:

This script has been now added, thank you @flashton2003 https://github.com/flashton2003

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/bcftools/issues/530#issuecomment-688411075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAT6FEWRFK5ALQCZKA5OABDSET7ZFANCNFSM4C2H3UEA .

samtools / bcftools

bcftools csq cannot parse bacterial gff3 files #530