microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

document mapping between NMDC-style GFF3 and schema annotation component #184

Open cmungall opened 3 years ago

cmungall commented 3 years ago

The schema has inline docs detailing the mapping but we should provide a higher level guide. I will sketch out in this ticket and then this can be turned into docs on the site. I'm doing quickly so if anything is confusing, it's likely I made a mistake. I will also give examples in yaml but the json cognate should be obvious

Example

Ga0185794_41    GeneMark.hmm-2 v1.05    CDS     48      1037    56.13   +       0       ID=Ga0185794_41_48_1037;translation_table=11;start_type=ATG;product=5-methylthioadenosine/S-adenosylhomocysteine deaminase;product_source=KO:K12960;cath_funfam=3.20.20.140;cog=COG0402;ko=KO:K12960;ec_number=EC:3.5.4.28,EC:3.5.4.31;pfam=PF01979;superfamily=51338,51556

This GFF line represents the output of structural annotation (the prediction of a CDS on sequence Ga0185794_41). This is given a protein ID (skolemized from reference + coordinates). Ga0185794_41_48_1037. Col9 represents the outputs of functional annotation. @scanon @hubin-keio do I have this right?

The core feature would be represented as an instance of https://microbiomedata.github.io/nmdc-metadata/docs/GenomeFeature

image

so our initial object would look like:

seqid: NMDC:Ga0185794_41 
start: 48
end: 1037
strand: "+"
type: SO:0000316 ## note we use a key to the SO object which is itself an instance of ControlledTerm
encodes: NMDC:Ga0185794_41_48_1037

Note all IDs are prefixed to conform to the NMDC identifiers standards doc

Note the 'encodes' field, to link to a GeneProduct (this will always be a Protein for the current pipeline, but in future we may have ncRNA annotations)

https://microbiomedata.github.io/nmdc-metadata/docs/GeneProduct

Currently the GeneProduct field is fairly bare, but in future additional fields could be added - e.g. AA seq. The GeneProduct is what functional annotations are attached to

https://microbiomedata.github.io/nmdc-metadata/docs/FunctionalAnnotation

Each ';` separated section in col9 of the GFF would correspond to a separate annotation. The annotation links the gene product to the controlled term

Minimally this looks like:

subject: NMDC:Ga0185794_41_48_1037.  ## this is the protein ID
has_function: KO:K12960
was_informed_by: <...provenance here>

For people familiar with the GO annotation system, each entry here represents a line in GAF or GPAD format files.

There would be one entry for each annotation. Above I am only showing the KO annotation

Note the string KO:K12960 is a key to a controlled term object. An example would be:

id: KEGG.KO:K12960.  ## note we follow identifiers.org standards here for unique IDs
name: mtaD
description: 5-methylthioadenosine/S-adenosylhomocysteine deaminase

This is a minimal representation of the KEGG KO object. It can be linked to other controlled term objects. For example, it can include mappings to other systems (EC), parent/child hierarchies, links to pathways, etc, and these can all be traced to compounds chemical entities. However, we will only support simple KO search for GSP, so I will not detail that here. Please refer to #176 for using pathway knowledge to implement more advanced search.

Annotations to other systems (e.g. Pfam) are handled analogously. Please see the id_prefixes field in the schema to see the canonical ID prefix for each system.

TODO: document how the feature connects to the metaG/T output

to be discussed

hubin-keio commented 3 years ago

Thanks so much, @cmungall . This is really helpful. Yes, col 9 is the annotation column. I have looked into the Python source code that is generated from the schema file and created a separate repository (microbiomedata/pynmdc) so that people can try out the code easily. You, @wdduncan @scanon have been added to this repo, among a few others.

What I need help with is how to:

  1. Translate 'CDS' to 'SO:0000316,' which is a ControlledTermValue?
  2. Transliate other features in annotation to ControlledTermValue?
  3. Define "encodes" in "GeneProduct"

Thanks!

hubin-keio commented 3 years ago

I was able to get the string "SO:0000316" from the SO-Ontologies but I still need to see an example of how to parse the correct mapping object to "type." Thanks, @cmungall @wdduncan.

cmungall commented 3 years ago

@hubin-keio - all you need to do is fill in the SO ID for the type. Let me know if you want help mapping other types (will we have other types than CDS?)

For 3, 'encodes' is the relationship between a feature like a CDS and the protein

hubin-keio commented 3 years ago

Thanks, @cmungall. I tried it using the schema (nmdc.py). Below is the code and error message:

nmdc_gf = schema.GenomeFeature( seqid=f'NMDC:{rec.id}', # record id start=feature_start, end=feature_end, strand=feature_strand, type='SO:0000316', encodes=f'NMDC:{feature_id}' # feature id # FIXME )

ERROR: test_GenomeFeature (main.testMetadata)

Traceback (most recent call last): File "/home/hu/Projects/NMDC/pynmdc/src/nmdc/tests/test_testadata.py", line 55, in test_GenomeFeature encodes=f'NMDC:{feature_id}' # feature id # FIXME File "", line 10, in init File "/home/hu/Projects/NMDC/pynmdc/src/nmdc/metadata/schema.py", line 1561, in __post_init__ self.type = ControlledTermValue(self.type) TypeError: type object argument after must be a mapping, not str

cmungall commented 3 years ago

Ah good catch, it looks like the schema was using a ControlledTermValue class which is a container for a OntologyClass. I'm fixing to just use OntologyClass directly. With my PR your code should work

(the point of the container class is for biosample attributes, where every attribute assignment has specific provenance attached, and also allows storage of unnormalized string forms of structured values)

Aside: I didn't know you were using the python object model - great! But if you like you can just create json objects directly with Python dicts, the choice is yours.

hubin-keio commented 3 years ago

Thanks. I just checked out branch issue-184-cv and it solved the problem of assigning string value to "type." For the other properties, how do I assign other features/properties to GenomeFeature objects, @cmungall ? Using the example you provided above, these features are pasted below. I was thinking using the Python Object Model generated from the schema may provide better data consistency. I can add separate functions to translate GenomeFeature objects to JSON.

    ID => ['Ga0185794_41_48_1037']
translation_table => ['11']
start_type => ['ATG']
product => ['5-methylthioadenosine/S-adenosylhomocysteine deaminase']
product_source => ['KO:K12960']
cath_funfam => ['3.20.20.140']
cog => ['COG0402']
ko => ['KO:K12960']
ec_number => ['EC:3.5.4.28', 'EC:3.5.4.31']
pfam => ['PF01979']
superfamily => ['51338', '51556']
source => ['GeneMark.hmm-2 v1.05']
score => ['56.13']
phase => ['0']
cmungall commented 3 years ago

Currently the schema doesn't support translation table, start_type. I don't see a use case for these in the immediate future so I just we proceed incrementally - don't include in the json for now, and return to this later.

For the source field, we have a provenance model (prov). If it's OK I'll return to describing how to fit in the program used for prediction into this. We'll also want to record provenance on each individual functional annotation (whether it comes from prokka, an hmm, ...) but again I suggest returning to this.

For all of the functional annotations, I suggest we use IDs using standardized identifiers.org / n2t.net prefixes. These are included in the yaml and also visible on the html docs.

E.g.

https://microbiomedata.github.io/nmdc-metadata/docs/OrthologyGroup

Identifier prefixes
KEGG.ORTHOLOGY
EGGNOG
PFAM
TIGRFAM
SUPFAM
PANTHER.FAMILY

so rather than KO:K12960 we would use the more standard KEGG.ORTHOLOGY:K12960

Tip for aim3: all identifiers in NMDC should be resolvable via identifiers.org or n2t.net

E.g. http://identifiers.org/KEGG.ORTHOLOGY:K12960

http://identifiers.org/CATH:3.20.20.140

cmungall commented 3 years ago

superfamily => ['51338', '51556']

is this correct?

http://supfam.org/SUPERFAMILY/51338

gives 404

is it this? https://registry.identifiers.org/registry/supfam

or this? https://registry.identifiers.org/registry/cath.superfamily

cmungall commented 3 years ago

on the call I volunteered me/@deepakunni3 to help @hubin-keio with the gff->json transform

to help we would need some sample gff3 files. This is what I have from a random img analysis, is this representative?

Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     2       217     9.84    -       0       ID=Ga0185794_01_2_217;translation_table=11;partial=3';start_type=ATG;product=isoaspartyl peptidase/L-asparaginase-like protein (Ntn-hydrolase superfamily);product_source=COG1446;cog=COG1446;pfam=PF01112;superfamily=56235
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     249     1208    52.49   +       0       ID=Ga0185794_01_249_1208;translation_table=11;start_type=TTG;product=hypothetical protein;product_source=Hypo-rule applied;superfamily=56784
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     1388    2383    46.14   +       0       ID=Ga0185794_01_1388_2383;translation_table=11;start_type=ATG;product=large subunit ribosomal protein L3;product_source=KO:K02906;cath_funfam=4.10.960.10;cog=COG0087;ko=KO:K02906;pfam=PF00297;superfamily=50447;tigrfam=TIGR03626
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     2399    3199    47.65   +       0       ID=Ga0185794_01_2399_3199;translation_table=11;start_type=TTG;product=large subunit ribosomal protein L4e;product_source=KO:K02930;cath_funfam=3.40.1370.10;cog=COG0088;ko=KO:K02930;pfam=PF00573;superfamily=52166;tigrfam=TIGR03672
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     3271    3498    11.41   +       0       ID=Ga0185794_01_3271_3498;translation_table=11;start_type=ATG;product=large subunit ribosomal protein L23;product_source=KO:K02892;cath_funfam=3.30.70.330;cog=COG0089;ko=KO:K02892;pfam=PF00276;superfamily=54189;tigrfam=TIGR03636
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     3511    4266    32.58   +       0       ID=Ga0185794_01_3511_4266;translation_table=11;start_type=ATG;product=large subunit ribosomal protein L2;product_source=KO:K02886;cath_funfam=2.30.30.30,2.40.50.140,4.10.950.10;cog=COG0090;ko=KO:K02886;pfam=PF00181,PF03947;smart=SM01382,SM01383;superfamily=50104,50249
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     4402    4995    48.79   -       0       ID=Ga0185794_01_4402_4995;translation_table=11;start_type=GTG;product=hypothetical protein;product_source=Hypo-rule applied;pfam=PF07691;superfamily=49785
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     5123    5308    3.46    -       0       ID=Ga0185794_01_5123_5308;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied;cath_funfam=3.30.565.10;superfamily=53335
Ga0185794_01    Prodigal v2.6.3 CDS     5378    5494    2.3     -       0       ID=Ga0185794_01_5378_5494;translation_table=11;start_type=ATG;product=2-polyprenyl-6-methoxyphenol hydroxylase-like FAD-dependent oxidoreductase;product_source=COG0654;cath_funfam=3.50.50.60;cog=COG0654;superfamily=51905
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     6206    7969    83.73   +       0       ID=Ga0185794_01_6206_7969;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied;smart=SM00933
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     8028    9782    88.16   +       0       ID=Ga0185794_01_8028_9782;translation_table=11;start_type=TTG;product=DNA helicase HerA-like ATPase;product_source=COG0433;cath_funfam=3.40.50.300;cog=COG0433;ko=KO:K06915;pfam=PF01935;superfamily=52540
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     9902    10324   15.37   +       0       ID=Ga0185794_01_9902_10324;translation_table=11;start_type=TTG;product=small subunit ribosomal protein S19;product_source=KO:K02965;cath_funfam=3.30.860.10;cog=COG0185;ko=KO:K02965;pfam=PF00203;superfamily=54570;tigrfam=TIGR01025
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     10406   10894   28.90   +       0       ID=Ga0185794_01_10406_10894;translation_table=11;start_type=TTG;product=nicotinamide-nucleotide adenylyltransferase;product_source=KO:K00952;cath_funfam=3.40.50.620;cog=COG1056;ko=KO:K00952;ec_number=EC:2.7.7.1;pfam=PF01467;superfamily=52374;tigrfam=TIGR01527
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     10910   11443   29.86   -       0       ID=Ga0185794_01_10910_11443;translation_table=11;start_type=ATG;product=O-acetyl-ADP-ribose deacetylase (regulator of RNase III);product_source=COG2110;cath_funfam=3.40.220.10;cog=COG2110;pfam=PF01661;smart=SM00506;superfamily=52949
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     11479   11637   1.91    -       0       ID=Ga0185794_01_11479_11637;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     11886   14015   139.12  +       0       ID=Ga0185794_01_11886_14015;translation_table=11;start_type=ATG;product=ATP-binding cassette subfamily C protein;product_source=KO:K06148;cath_funfam=1.20.1560.10,2.30.29.50,3.40.50.300;cog=COG1132;ko=KO:K06148;pfam=PF00005,PF00664,PF14470;smart=SM00382;superfamily=50729,52540
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     14021   14530   15.91   +       0       ID=Ga0185794_01_14021_14530;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied;pfam=PF08909
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     14729   15001   1.01    -       0       ID=Ga0185794_01_14729_15001;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     14998   15921   65.79   -       0       ID=Ga0185794_01_14998_15921;translation_table=11;start_type=TTG;product=aspartate carbamoyltransferase catalytic subunit;product_source=KO:K00609;cath_funfam=3.40.50.1370;cog=COG0540;ko=KO:K00609;ec_number=EC:2.1.3.2;pfam=PF00185,PF02729;superfamily=53671;tigrfam=TIGR00670
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     16069   16869   43.95   -       0       ID=Ga0185794_01_16069_16869;translation_table=11;start_type=ATG;product=D-amino peptidase;product_source=KO:K16203;cath_funfam=3.30.1360.130,3.40.50.10780;cog=COG2362;ko=KO:K16203;ec_number=EC:3.4.11.-;pfam=PF04951;superfamily=63992
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     17103   18191   67.60   +       0       ID=Ga0185794_01_17103_18191;translation_table=11;start_type=TTG;product=tryptophanyl-tRNA synthetase;product_source=KO:K01867;cath_funfam=1.10.240.10,3.40.50.620;cog=COG0180;ko=KO:K01867;ec_number=EC:6.1.1.2;pfam=PF00579;superfamily=52374;tigrfam=TIGR00233
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     18195   19700   119.58  +       0       ID=Ga0185794_01_18195_19700;translation_table=11;start_type=TTG;product=phenylalanyl-tRNA synthetase alpha chain;product_source=KO:K01889;cath_funfam=3.30.930.10;cog=COG0016;ko=KO:K01889;ec_number=EC:6.1.1.20;pfam=PF01409;superfamily=46785,55681;tigrfam=TIGR00468
Ga0185794_01    GeneMark.hmm-2 v1.05    CDS     19706   21400   105.79  +       0       ID=Ga0185794_01_19706_21400;translation_table=11;start_type=ATG;product=phenylalanyl-tRNA synthetase beta chain;product_source=KO:K01890;cath_funfam=3.30.56.10,3.30.930.10,3.50.40.10;cog=COG0072;ko=KO:K01890;ec_number=EC:6.1.1.20;pfam=PF03483,PF03484;smart=SM00873;superfamily=55681,56037;tigrfam=TIGR00471
dehays commented 3 years ago

@cmungall Here are a a few of the 138661 lines of 1781_1000325_functional_annotation.gff (from /global/project/projectdirs/m3408/ficus/pipeline_products/1781_100325/annotation/ ). This for one of the current Stegen metaG annotation workflow outputs.

1781_100325_scf_1000_c1 GeneMark.hmm-2 v1.05    CDS     3       1700    192.32  +       0       ID=1781_100325_scf_1000_c1_3_1700;translation_table=11;partial=5',3';product=PAS domain S-box-containing protein;product_source=TIGR00229;cath_funfam=3.30.450.20;cog=COG2202;pfam=GA,PA,PAS_;smart=55781,55785;superfamily=SM00086,SM00091;tigrfam=TIGR00229
1781_100325_scf_1001_c1 GeneMark.hmm-2 v1.05    CDS     82      573     13.74   +       0       ID=1781_100325_scf_1001_c1_82_573;translation_table=11;product=uncharacterized membrane protein;product_source=COG2237;cog=COG2237
1781_100325_scf_1001_c1 GeneMark.hmm-2 v1.05    CDS     859     1671    32.29   -       0       ID=1781_100325_scf_1001_c1_859_1671;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05    CDS     1       99      0.46    +       0       ID=1781_100325_scf_1002_c1_1_99;translation_table=11;partial=5';product=large subunit ribosomal protein L18;product_source=KO:K02881;ko=KO:K02881
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05    CDS     96      731     37.60   +       0       ID=1781_100325_scf_1002_c1_96_731;translation_table=11;product=small subunit ribosomal protein S5;product_source=KO:K02988;cath_funfam=3.30.160.20,3.30.230.10;cog=COG0098;ko=KO:K02988;pfam=Ribosomal_S,Ribosomal_S5_;smart=54211,54768;tigrfam=TIGR01020
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05    CDS     765     1205    25.99   +       0       ID=1781_100325_scf_1002_c1_765_1205;translation_table=11;product=large subunit ribosomal protein L30;product_source=KO:K02907;cath_funfam=3.30.1390.20;cog=COG1841;ko=KO:K02907;pfam=Ribosomal_L3;smart=55129;tigrfam=TIGR01309
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05    CDS     1207    1629    20.09   +       0       ID=1781_100325_scf_1002_c1_1207_1629;translation_table=11;product=large subunit ribosomal protein L15;product_source=KO:K02876;cath_funfam=3.100.10.10,4.10.990.10;cog=COG0200;ko=KO:K02876;pfam=Ribosomal_L27;smart=52080
1781_100325_scf_1002_c1 Prodigal v2.6.3 CDS     1626    1700    -4.7    +       0       ID=1781_100325_scf_1002_c1_1626_1700;translation_table=11;partial=3';product=hypothetical protein;product_source=Hypo-rule applied
1781_100325_scf_1003_c1 GeneMark.hmm-2 v1.05    CDS     1       543     44.68   -       0       ID=1781_100325_scf_1003_c1_1_543;translation_table=11;partial=3';product=drug/metabolite transporter (DMT)-like permease;product_source=COG0697;cog=COG0697;pfam=Eam;smart=103481
1781_100325_scf_1003_c1 GeneMark.hmm-2 v1.05    CDS     597     1697    97.45   +       0       ID=1781_100325_scf_1003_c1_597_1697;translation_table=11;partial=3';product=phosphoribosylformylglycinamidine synthase;product_source=KO:K01952;cath_funfam=3.30.1330.10,3.90.650.10;cog=COG0046;ko=KO:K01952;ec_number=EC:6.3.5.3;pfam=AIR,AIRS_;smart=55326,56042;tigrfam=TIGR01736
1781_100325_scf_1004_c1 GeneMark.hmm-2 v1.05    CDS     1       255     8.85    -       0       ID=1781_100325_scf_1004_c1_1_255;translation_table=11;partial=3';product=predicted RNA-binding protein with TRAM domain;product_source=COG3269;cath_funfam=2.40.50.140;cog=COG3269;pfam=TRA;smart=50249
1781_100325_scf_1004_c1 GeneMark.hmm-2 v1.05    CDS     313     1146    46.17   -       0       ID=1781_100325_scf_1004_c1_313_1146;translation_table=11;product=aspartate dehydrogenase;product_source=KO:K06989;cath_funfam=3.30.360.10,3.40.50.720;cog=COG1712;ko=KO:K06989;ec_number=EC:1.4.1.21;pfam=DUF10,DapB_,NAD_binding_;smart=51735,55347;tigrfam=TIGR03855
1781_100325_scf_1004_c1 GeneMark.hmm-2 v1.05    CDS     1251    1433    2.26    -       0       ID=1781_100325_scf_1004_c1_1251_1433;translation_table=11;product=small subunit ribosomal protein S30e;product_source=KO:K02983;cog=COG4919;ko=KO:K02983;pfam=Ribosomal_S3
1781_100325_scf_1005_c1 GeneMark.hmm-2 v1.05    CDS     2       829     71.94   -       0       ID=1781_100325_scf_1005_c1_2_829;translation_table=11;partial=3';product=DNA polymerase-4;product_source=KO:K02346;cath_funfam=1.10.150.20,3.30.70.270;cog=COG0389;ko=KO:K02346;ec_number=EC:2.7.7.7;pfam=IM;smart=56672;superfamily=SM00278
1781_100325_scf_1005_c1 GeneMark.hmm-2 v1.05    CDS     877     1473    59.07   -       0       ID=1781_100325_scf_1005_c1_877_1473;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;cath_funfam=1.20.5.100;smart=51735
1781_100325_scf_1006_c1 GeneMark.hmm-2 v1.05    CDS     2       730     86.59   -       0       ID=1781_100325_scf_1006_c1_2_730;translation_table=11;partial=3';product=cysteinyl-tRNA synthetase;product_source=KO:K01883;cath_funfam=3.40.50.620;cog=COG0215;ko=KO:K01883;ec_number=EC:6.1.1.16;pfam=tRNA-synt_1;smart=52374
1781_100325_scf_1006_c1 GeneMark.hmm-2 v1.05    CDS     761     1696    110.06  -       0       ID=1781_100325_scf_1006_c1_761_1696;translation_table=11;partial=5';product=ATP-dependent Zn protease;product_source=COG0465;cath_funfam=1.10.8.60;cog=COG0465;pfam=Peptidase_M4;smart=140990
1781_100325_scf_1007_c1 GeneMark.hmm-2 v1.05    CDS     264     1550    115.96  -       0       ID=1781_100325_scf_1007_c1_264_1550;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;pfam=DDE_Tnp_,DDE_Tnp_1_;smart=53098
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05    CDS     9       113     5.74    +       0       ID=1781_100325_scf_1008_c1_9_113;translation_table=11;product=elongation factor P;product_source=KO:K02356;cath_funfam=2.40.50.140;ko=KO:K02356;pfam=Elong-fact-P_;smart=50249;superfamily=SM00841
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05    CDS     118     546     41.51   +       0       ID=1781_100325_scf_1008_c1_118_546;translation_table=11;product=N utilization substance protein B;product_source=KO:K03625;cath_funfam=1.10.940.10;cog=COG0781;ko=KO:K03625;pfam=Nus;smart=48013;tigrfam=TIGR01951
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05    CDS     554     760     24.98   -       0       ID=1781_100325_scf_1008_c1_554_760;translation_table=11;product=sec-independent protein translocase protein TatA;product_source=KO:K03116;cog=COG1826;ko=KO:K03116;pfam=MttA_Hcf10;tigrfam=TIGR01411
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05    CDS     996     1562    81.24   +       0       ID=1781_100325_scf_1008_c1_996_1562;translation_table=11;product=pyrimidine operon attenuation protein/uracil phosphoribosyltransferase;product_source=KO:K02825;cath_funfam=3.40.50.2020;cog=COG2065;ko=KO:K02825;ec_number=EC:2.4.2.9;pfam=Pribosyltra;smart=53271
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05    CDS     1559    1696    8.04    +       0       ID=1781_100325_scf_1008_c1_1559_1696;translation_table=11;partial=3';product=aspartate carbamoyltransferase catalytic subunit;product_source=KO:K00609;cath_funfam=3.40.50.1370;cog=COG0540;ko=KO:K00609;ec_number=EC:2.1.3.2;smart=53671
1781_100325_scf_1009_c1 GeneMark.hmm-2 v1.05    CDS     97      582     43.72   +       0       ID=1781_100325_scf_1009_c1_97_582;translation_table=11;product=HEAT repeat protein;product_source=COG1413;cath_funfam=1.25.10.20;cog=COG1413;pfam=HEAT_;smart=48371
1781_100325_scf_1009_c1 Prodigal v2.6.3 CDS     579     749     8.4     -       0       ID=1781_100325_scf_1009_c1_579_749;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;smart=57802
1781_100325_scf_1009_c1 GeneMark.hmm-2 v1.05    CDS     746     1693    36.99   -       0       ID=1781_100325_scf_1009_c1_746_1693;translation_table=11;product=integrase/recombinase XerD;product_source=KO:K04763;cath_funfam=1.10.150.130,1.10.443.10;cog=COG4974;ko=KO:K04763;pfam=Phage_int_SAM_,Phage_integras;smart=56349
1781_100325_scf_100_c1  GeneMark.hmm-2 v1.05    CDS     2       397     16.15   +       0       ID=1781_100325_scf_100_c1_2_397;translation_table=11;partial=5';product=two-component system nitrogen regulation response regulator GlnG;product_source=KO:K07712;cath_funfam=3.40.50.2300;cog=COG3437;ko=KO:K07712;pfam=Response_re;smart=52172
1781_100325_scf_100_c1  GeneMark.hmm-2 v1.05    CDS     670     2514    85.48   +       0       ID=1781_100325_scf_100_c1_670_2514;translation_table=11;product=signal transduction histidine kinase;product_source=COG0642;cath_funfam=1.10.287.130,2.60.15.10,3.30.565.10;cog=COG0642;pfam=HATPase_,HisK,dCache_;smart=103190,55021,55874;superfamily=SM00387,SM00388
1781_100325_scf_100_c1  GeneMark.hmm-2 v1.05    CDS     2522    3418    30.63   +       0       ID=1781_100325_scf_100_c1_2522_3418;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;smart=81342
1781_100325_scf_100_c1  GeneMark.hmm-2 v1.05    CDS     3598    3972    11.31   -       0       ID=1781_100325_scf_100_c1_3598_3972;translation_table=11;partial=5';product=DNA-binding beta-propeller fold protein YncE;product_source=COG3391;cath_funfam=2.120.10.30;cog=COG3391;pfam=DUF512,NH;smart=101898
1781_100325_scf_1010_c1 GeneMark.hmm-2 v1.05    CDS     3       815     59.75   -       0       ID=1781_100325_scf_1010_c1_3_815;translation_table=11;partial=3';product=spermidine/putrescine transport system permease protein;product_source=KO:K11070;cath_funfam=1.10.3720.10;cog=COG1177;ko=KO:K11070;pfam=BPD_transp_;smart=161098
1781_100325_scf_1010_c1 GeneMark.hmm-2 v1.05    CDS     812     1696    92.29   -       0       ID=1781_100325_scf_1010_c1_812_1696;translation_table=11;partial=5';product=spermidine/putrescine transport system permease protein;product_source=KO:K11071;cath_funfam=1.10.3720.10;cog=COG1176;ko=KO:K11071;pfam=BPD_transp_;smart=161098
cmungall commented 3 years ago

@hubin-keio has provided examples here: https://github.com/microbiomedata/pynmdc/tree/main/src/nmdc/test_data

hubin-keio commented 3 years ago

Can you (@cmungall) provide a complete JSON version of the original example (Ga0185794_41)? I have pulled 1000 lines of a gff file from an early run of the annotation workflow and it is available here: https://github.com/microbiomedata/pynmdc/tree/main/src/nmdc/test_data/MetaG_annotation

I am still working on the converter. The unfinished version is here: https://github.com/microbiomedata/pynmdc

I would like to see a standard JSON output example before finalize the converter. Thanks.

deepakunni3 commented 3 years ago

@hubin-keio Quick observations:

For some examples of the JSON, see here:

cmungall commented 3 years ago

I added Deepak's examples to the repo in the examples folder: https://github.com/microbiomedata/nmdc-metadata/tree/master/examples

(not we also validate against all examples in this folder as unit tests and within github/travis CI)

hubin-keio commented 3 years ago

Thanks for the comments. @deepakunni3, is your parser working? I have committed the last planned update before GSP this morning.

hubin-keio commented 3 years ago

The "was_generated_by": "N/A"" field is still there in your examples. Maybe you want to remove it in your code?

deepakunni3 commented 3 years ago

Yes, the "N/A" was a placeholder to remind us that this information is missing and needs to be incorporated. Will remove from the script.

hubin-keio commented 3 years ago

In the discussions in Aim1_standards channel it was mentioned on 1/9 "yes, never use values like "N/A", always make it an explicit json null, or simply omit the key altoogether." But I am fine with your parser solution as long as it is okay among Aim 1 and 3. Please put in Aim 2 channel the location of your parser once it is done so that we can process the GFFs. Aim 3 needs the JSONs ready by this Friday (1/15).

deepakunni3 commented 3 years ago

I am not sure what the expectation here is between your pynmdc converter vs my GFF3 converter.

Perhaps we can talk more on the technical call today.

Regarding the "N/A", thanks for clarifying. That makes sense. I can replace that with null

cmungall commented 3 years ago

the was generated by field should link to the MetagenomeAnnotation activity

we will better document this in the schema

(this answers @scanon's Q on the tech sync call)