mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
49 stars 12 forks source link

Add command to expand xrefs section in GFF3 files #483

Open cmungall opened 8 months ago

cmungall commented 8 months ago

I am 97% sure this is out of scope for sssom-py and this should be either it's own tool or something as part of a general gff package. But this seems like a good place to seed the idea.

GFF allows various kinds of annotations in column 9, many of these are CURIEs. It's often useful to expand these. E.g. a gene annotated with an EC by prokka could be expanded to a GO annotation using a GO sssom file.

cthoyt commented 8 months ago

Can you link to an example of a GFF file please?

cmungall commented 8 months ago

Here is the first few lines of the output of prokka run on a metagenomic sample (downloaded from here in NMDC).

Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     18      563     72.59   -       0       ID=Ga0495479_0000001_18_563;translation_table=11;start_type=ATG;product=5-methylcytosine-specific restriction endonuclease McrA;product_source=COG1403;cog=COG1403;pfam=PF14279
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     692     1357    88.17   -       0       ID=Ga0495479_0000001_692_1357;translation_table=11;start_type=ATG;product=phospholipase/carboxylesterase;product_source=KO:K06999;cath_funfam=3.40.50.1820;cog=COG0400;ko=KO:K06999;pfam=PF02230;superfamily=53474
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     1415    2068    95.20   -       0       ID=Ga0495479_0000001_1415_2068;translation_table=11;start_type=ATG;product=DNA-3-methyladenine glycosylase II;product_source=KO:K01247;cath_funfam=1.10.1670.10,1.10.340.30;cog=COG0122;ko=KO:K01247;ec_number=EC:3.2.2.21;pfam=PF00730;smart=SM00478;superfamily=48150
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     2223    3116    110.08  +       0       ID=Ga0495479_0000001_2223_3116;translation_table=11;start_type=ATG;product=glutamyl-Q tRNA(Asp) synthetase;product_source=KO:K01894;cath_funfam=3.40.50.620;cog=COG0008;ko=KO:K01894;ec_number=EC:6.1.1.-;pfam=PF00749;superfamily=52374
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     3293    4492    183.16  +       0       ID=Ga0495479_0000001_3293_4492;translation_table=11;start_type=ATG;product=CheY-like chemotaxis protein;product_source=COG0784;cath_funfam=1.10.287.130,3.30.565.10,3.40.50.2300;cog=COG0784;pfam=PF00072,PF00512,PF02518;smart=SM00387,SM00388,SM00448;superfamily=47384,55874
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     4632    6602    342.80  -       0       ID=Ga0495479_0000001_4632_6602;translation_table=11;start_type=ATG;product=(2R)-ethylmalonyl-CoA mutase;product_source=KO:K14447;cath_funfam=3.20.20.240,3.40.50.280;cog=COG1884,COG2185;ko=KO:K14447;pfam=PF01642,PF02310;superfamily=51703,52242;tigrfam=TIGR00640,TIGR00641
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     6630    6881    34.32   -       0       ID=Ga0495479_0000001_6630_6881;translation_table=11;start_type=ATG;product=uncharacterized membrane protein YeaQ/YmgE (transglycosylase-associated protein family);product_source=COG2261;cog=COG2261;pfam=PF04226
Ga0495479_0000001       GeneMark.hmm-2 v1.05    CDS     7044    7304    41.09   -       0       ID=Ga0495479_0000001_7044_7304;translation_table=11;start_type=GTG;product=uncharacterized membrane protein YeaQ/YmgE (transglycosylase-associated protein family);product_source=COG2261;cog=COG2261;pfam=PF04226

GFF doesn't have a particularly formal way of ensuring identifiers are unambiguous. In some flavours of GFF you will see bona fide CURIEs, sometimes it's somewhat implicit from the key (e.g. cog, pfam, ec_number, ...). See this preprint for recommendations on improving this situation.

Now I look at the prokka file again I see that it's not even using the recommended Ontology_term attribute, so this is looking more like some kind of bespoke gff tool that takes into account multiple idiosyncracies, definitely outside sssom-py