Open cmungall opened 3 years ago
Thanks so much, @cmungall . This is really helpful. Yes, col 9 is the annotation column. I have looked into the Python source code that is generated from the schema file and created a separate repository (microbiomedata/pynmdc) so that people can try out the code easily. You, @wdduncan @scanon have been added to this repo, among a few others.
What I need help with is how to:
Thanks!
I was able to get the string "SO:0000316" from the SO-Ontologies but I still need to see an example of how to parse the correct mapping object to "type." Thanks, @cmungall @wdduncan.
@hubin-keio - all you need to do is fill in the SO ID for the type. Let me know if you want help mapping other types (will we have other types than CDS?)
For 3, 'encodes' is the relationship between a feature like a CDS and the protein
Thanks, @cmungall. I tried it using the schema (nmdc.py). Below is the code and error message:
nmdc_gf = schema.GenomeFeature( seqid=f'NMDC:{rec.id}', # record id start=feature_start, end=feature_end, strand=feature_strand, type='SO:0000316', encodes=f'NMDC:{feature_id}' # feature id # FIXME )
Traceback (most recent call last):
File "/home/hu/Projects/NMDC/pynmdc/src/nmdc/tests/test_testadata.py", line 55, in test_GenomeFeature
encodes=f'NMDC:{feature_id}' # feature id # FIXME
File "
Ah good catch, it looks like the schema was using a ControlledTermValue class which is a container for a OntologyClass. I'm fixing to just use OntologyClass directly. With my PR your code should work
(the point of the container class is for biosample attributes, where every attribute assignment has specific provenance attached, and also allows storage of unnormalized string forms of structured values)
Aside: I didn't know you were using the python object model - great! But if you like you can just create json objects directly with Python dicts, the choice is yours.
Thanks. I just checked out branch issue-184-cv and it solved the problem of assigning string value to "type." For the other properties, how do I assign other features/properties to GenomeFeature objects, @cmungall ? Using the example you provided above, these features are pasted below. I was thinking using the Python Object Model generated from the schema may provide better data consistency. I can add separate functions to translate GenomeFeature objects to JSON.
ID => ['Ga0185794_41_48_1037']
translation_table => ['11']
start_type => ['ATG']
product => ['5-methylthioadenosine/S-adenosylhomocysteine deaminase']
product_source => ['KO:K12960']
cath_funfam => ['3.20.20.140']
cog => ['COG0402']
ko => ['KO:K12960']
ec_number => ['EC:3.5.4.28', 'EC:3.5.4.31']
pfam => ['PF01979']
superfamily => ['51338', '51556']
source => ['GeneMark.hmm-2 v1.05']
score => ['56.13']
phase => ['0']
Currently the schema doesn't support translation table, start_type. I don't see a use case for these in the immediate future so I just we proceed incrementally - don't include in the json for now, and return to this later.
For the source field, we have a provenance model (prov). If it's OK I'll return to describing how to fit in the program used for prediction into this. We'll also want to record provenance on each individual functional annotation (whether it comes from prokka, an hmm, ...) but again I suggest returning to this.
For all of the functional annotations, I suggest we use IDs using standardized identifiers.org / n2t.net prefixes. These are included in the yaml and also visible on the html docs.
E.g.
https://microbiomedata.github.io/nmdc-metadata/docs/OrthologyGroup
Identifier prefixes
KEGG.ORTHOLOGY
EGGNOG
PFAM
TIGRFAM
SUPFAM
PANTHER.FAMILY
so rather than KO:K12960 we would use the more standard KEGG.ORTHOLOGY:K12960
Tip for aim3: all identifiers in NMDC should be resolvable via identifiers.org or n2t.net
superfamily => ['51338', '51556']
is this correct?
http://supfam.org/SUPERFAMILY/51338
gives 404
is it this? https://registry.identifiers.org/registry/supfam
or this? https://registry.identifiers.org/registry/cath.superfamily
on the call I volunteered me/@deepakunni3 to help @hubin-keio with the gff->json transform
to help we would need some sample gff3 files. This is what I have from a random img analysis, is this representative?
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 2 217 9.84 - 0 ID=Ga0185794_01_2_217;translation_table=11;partial=3';start_type=ATG;product=isoaspartyl peptidase/L-asparaginase-like protein (Ntn-hydrolase superfamily);product_source=COG1446;cog=COG1446;pfam=PF01112;superfamily=56235
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 249 1208 52.49 + 0 ID=Ga0185794_01_249_1208;translation_table=11;start_type=TTG;product=hypothetical protein;product_source=Hypo-rule applied;superfamily=56784
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 1388 2383 46.14 + 0 ID=Ga0185794_01_1388_2383;translation_table=11;start_type=ATG;product=large subunit ribosomal protein L3;product_source=KO:K02906;cath_funfam=4.10.960.10;cog=COG0087;ko=KO:K02906;pfam=PF00297;superfamily=50447;tigrfam=TIGR03626
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 2399 3199 47.65 + 0 ID=Ga0185794_01_2399_3199;translation_table=11;start_type=TTG;product=large subunit ribosomal protein L4e;product_source=KO:K02930;cath_funfam=3.40.1370.10;cog=COG0088;ko=KO:K02930;pfam=PF00573;superfamily=52166;tigrfam=TIGR03672
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 3271 3498 11.41 + 0 ID=Ga0185794_01_3271_3498;translation_table=11;start_type=ATG;product=large subunit ribosomal protein L23;product_source=KO:K02892;cath_funfam=3.30.70.330;cog=COG0089;ko=KO:K02892;pfam=PF00276;superfamily=54189;tigrfam=TIGR03636
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 3511 4266 32.58 + 0 ID=Ga0185794_01_3511_4266;translation_table=11;start_type=ATG;product=large subunit ribosomal protein L2;product_source=KO:K02886;cath_funfam=2.30.30.30,2.40.50.140,4.10.950.10;cog=COG0090;ko=KO:K02886;pfam=PF00181,PF03947;smart=SM01382,SM01383;superfamily=50104,50249
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 4402 4995 48.79 - 0 ID=Ga0185794_01_4402_4995;translation_table=11;start_type=GTG;product=hypothetical protein;product_source=Hypo-rule applied;pfam=PF07691;superfamily=49785
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 5123 5308 3.46 - 0 ID=Ga0185794_01_5123_5308;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied;cath_funfam=3.30.565.10;superfamily=53335
Ga0185794_01 Prodigal v2.6.3 CDS 5378 5494 2.3 - 0 ID=Ga0185794_01_5378_5494;translation_table=11;start_type=ATG;product=2-polyprenyl-6-methoxyphenol hydroxylase-like FAD-dependent oxidoreductase;product_source=COG0654;cath_funfam=3.50.50.60;cog=COG0654;superfamily=51905
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 6206 7969 83.73 + 0 ID=Ga0185794_01_6206_7969;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied;smart=SM00933
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 8028 9782 88.16 + 0 ID=Ga0185794_01_8028_9782;translation_table=11;start_type=TTG;product=DNA helicase HerA-like ATPase;product_source=COG0433;cath_funfam=3.40.50.300;cog=COG0433;ko=KO:K06915;pfam=PF01935;superfamily=52540
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 9902 10324 15.37 + 0 ID=Ga0185794_01_9902_10324;translation_table=11;start_type=TTG;product=small subunit ribosomal protein S19;product_source=KO:K02965;cath_funfam=3.30.860.10;cog=COG0185;ko=KO:K02965;pfam=PF00203;superfamily=54570;tigrfam=TIGR01025
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 10406 10894 28.90 + 0 ID=Ga0185794_01_10406_10894;translation_table=11;start_type=TTG;product=nicotinamide-nucleotide adenylyltransferase;product_source=KO:K00952;cath_funfam=3.40.50.620;cog=COG1056;ko=KO:K00952;ec_number=EC:2.7.7.1;pfam=PF01467;superfamily=52374;tigrfam=TIGR01527
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 10910 11443 29.86 - 0 ID=Ga0185794_01_10910_11443;translation_table=11;start_type=ATG;product=O-acetyl-ADP-ribose deacetylase (regulator of RNase III);product_source=COG2110;cath_funfam=3.40.220.10;cog=COG2110;pfam=PF01661;smart=SM00506;superfamily=52949
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 11479 11637 1.91 - 0 ID=Ga0185794_01_11479_11637;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 11886 14015 139.12 + 0 ID=Ga0185794_01_11886_14015;translation_table=11;start_type=ATG;product=ATP-binding cassette subfamily C protein;product_source=KO:K06148;cath_funfam=1.20.1560.10,2.30.29.50,3.40.50.300;cog=COG1132;ko=KO:K06148;pfam=PF00005,PF00664,PF14470;smart=SM00382;superfamily=50729,52540
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 14021 14530 15.91 + 0 ID=Ga0185794_01_14021_14530;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied;pfam=PF08909
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 14729 15001 1.01 - 0 ID=Ga0185794_01_14729_15001;translation_table=11;start_type=ATG;product=hypothetical protein;product_source=Hypo-rule applied
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 14998 15921 65.79 - 0 ID=Ga0185794_01_14998_15921;translation_table=11;start_type=TTG;product=aspartate carbamoyltransferase catalytic subunit;product_source=KO:K00609;cath_funfam=3.40.50.1370;cog=COG0540;ko=KO:K00609;ec_number=EC:2.1.3.2;pfam=PF00185,PF02729;superfamily=53671;tigrfam=TIGR00670
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 16069 16869 43.95 - 0 ID=Ga0185794_01_16069_16869;translation_table=11;start_type=ATG;product=D-amino peptidase;product_source=KO:K16203;cath_funfam=3.30.1360.130,3.40.50.10780;cog=COG2362;ko=KO:K16203;ec_number=EC:3.4.11.-;pfam=PF04951;superfamily=63992
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 17103 18191 67.60 + 0 ID=Ga0185794_01_17103_18191;translation_table=11;start_type=TTG;product=tryptophanyl-tRNA synthetase;product_source=KO:K01867;cath_funfam=1.10.240.10,3.40.50.620;cog=COG0180;ko=KO:K01867;ec_number=EC:6.1.1.2;pfam=PF00579;superfamily=52374;tigrfam=TIGR00233
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 18195 19700 119.58 + 0 ID=Ga0185794_01_18195_19700;translation_table=11;start_type=TTG;product=phenylalanyl-tRNA synthetase alpha chain;product_source=KO:K01889;cath_funfam=3.30.930.10;cog=COG0016;ko=KO:K01889;ec_number=EC:6.1.1.20;pfam=PF01409;superfamily=46785,55681;tigrfam=TIGR00468
Ga0185794_01 GeneMark.hmm-2 v1.05 CDS 19706 21400 105.79 + 0 ID=Ga0185794_01_19706_21400;translation_table=11;start_type=ATG;product=phenylalanyl-tRNA synthetase beta chain;product_source=KO:K01890;cath_funfam=3.30.56.10,3.30.930.10,3.50.40.10;cog=COG0072;ko=KO:K01890;ec_number=EC:6.1.1.20;pfam=PF03483,PF03484;smart=SM00873;superfamily=55681,56037;tigrfam=TIGR00471
@cmungall Here are a a few of the 138661 lines of 1781_1000325_functional_annotation.gff (from /global/project/projectdirs/m3408/ficus/pipeline_products/1781_100325/annotation/ ). This for one of the current Stegen metaG annotation workflow outputs.
1781_100325_scf_1000_c1 GeneMark.hmm-2 v1.05 CDS 3 1700 192.32 + 0 ID=1781_100325_scf_1000_c1_3_1700;translation_table=11;partial=5',3';product=PAS domain S-box-containing protein;product_source=TIGR00229;cath_funfam=3.30.450.20;cog=COG2202;pfam=GA,PA,PAS_;smart=55781,55785;superfamily=SM00086,SM00091;tigrfam=TIGR00229
1781_100325_scf_1001_c1 GeneMark.hmm-2 v1.05 CDS 82 573 13.74 + 0 ID=1781_100325_scf_1001_c1_82_573;translation_table=11;product=uncharacterized membrane protein;product_source=COG2237;cog=COG2237
1781_100325_scf_1001_c1 GeneMark.hmm-2 v1.05 CDS 859 1671 32.29 - 0 ID=1781_100325_scf_1001_c1_859_1671;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05 CDS 1 99 0.46 + 0 ID=1781_100325_scf_1002_c1_1_99;translation_table=11;partial=5';product=large subunit ribosomal protein L18;product_source=KO:K02881;ko=KO:K02881
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05 CDS 96 731 37.60 + 0 ID=1781_100325_scf_1002_c1_96_731;translation_table=11;product=small subunit ribosomal protein S5;product_source=KO:K02988;cath_funfam=3.30.160.20,3.30.230.10;cog=COG0098;ko=KO:K02988;pfam=Ribosomal_S,Ribosomal_S5_;smart=54211,54768;tigrfam=TIGR01020
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05 CDS 765 1205 25.99 + 0 ID=1781_100325_scf_1002_c1_765_1205;translation_table=11;product=large subunit ribosomal protein L30;product_source=KO:K02907;cath_funfam=3.30.1390.20;cog=COG1841;ko=KO:K02907;pfam=Ribosomal_L3;smart=55129;tigrfam=TIGR01309
1781_100325_scf_1002_c1 GeneMark.hmm-2 v1.05 CDS 1207 1629 20.09 + 0 ID=1781_100325_scf_1002_c1_1207_1629;translation_table=11;product=large subunit ribosomal protein L15;product_source=KO:K02876;cath_funfam=3.100.10.10,4.10.990.10;cog=COG0200;ko=KO:K02876;pfam=Ribosomal_L27;smart=52080
1781_100325_scf_1002_c1 Prodigal v2.6.3 CDS 1626 1700 -4.7 + 0 ID=1781_100325_scf_1002_c1_1626_1700;translation_table=11;partial=3';product=hypothetical protein;product_source=Hypo-rule applied
1781_100325_scf_1003_c1 GeneMark.hmm-2 v1.05 CDS 1 543 44.68 - 0 ID=1781_100325_scf_1003_c1_1_543;translation_table=11;partial=3';product=drug/metabolite transporter (DMT)-like permease;product_source=COG0697;cog=COG0697;pfam=Eam;smart=103481
1781_100325_scf_1003_c1 GeneMark.hmm-2 v1.05 CDS 597 1697 97.45 + 0 ID=1781_100325_scf_1003_c1_597_1697;translation_table=11;partial=3';product=phosphoribosylformylglycinamidine synthase;product_source=KO:K01952;cath_funfam=3.30.1330.10,3.90.650.10;cog=COG0046;ko=KO:K01952;ec_number=EC:6.3.5.3;pfam=AIR,AIRS_;smart=55326,56042;tigrfam=TIGR01736
1781_100325_scf_1004_c1 GeneMark.hmm-2 v1.05 CDS 1 255 8.85 - 0 ID=1781_100325_scf_1004_c1_1_255;translation_table=11;partial=3';product=predicted RNA-binding protein with TRAM domain;product_source=COG3269;cath_funfam=2.40.50.140;cog=COG3269;pfam=TRA;smart=50249
1781_100325_scf_1004_c1 GeneMark.hmm-2 v1.05 CDS 313 1146 46.17 - 0 ID=1781_100325_scf_1004_c1_313_1146;translation_table=11;product=aspartate dehydrogenase;product_source=KO:K06989;cath_funfam=3.30.360.10,3.40.50.720;cog=COG1712;ko=KO:K06989;ec_number=EC:1.4.1.21;pfam=DUF10,DapB_,NAD_binding_;smart=51735,55347;tigrfam=TIGR03855
1781_100325_scf_1004_c1 GeneMark.hmm-2 v1.05 CDS 1251 1433 2.26 - 0 ID=1781_100325_scf_1004_c1_1251_1433;translation_table=11;product=small subunit ribosomal protein S30e;product_source=KO:K02983;cog=COG4919;ko=KO:K02983;pfam=Ribosomal_S3
1781_100325_scf_1005_c1 GeneMark.hmm-2 v1.05 CDS 2 829 71.94 - 0 ID=1781_100325_scf_1005_c1_2_829;translation_table=11;partial=3';product=DNA polymerase-4;product_source=KO:K02346;cath_funfam=1.10.150.20,3.30.70.270;cog=COG0389;ko=KO:K02346;ec_number=EC:2.7.7.7;pfam=IM;smart=56672;superfamily=SM00278
1781_100325_scf_1005_c1 GeneMark.hmm-2 v1.05 CDS 877 1473 59.07 - 0 ID=1781_100325_scf_1005_c1_877_1473;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;cath_funfam=1.20.5.100;smart=51735
1781_100325_scf_1006_c1 GeneMark.hmm-2 v1.05 CDS 2 730 86.59 - 0 ID=1781_100325_scf_1006_c1_2_730;translation_table=11;partial=3';product=cysteinyl-tRNA synthetase;product_source=KO:K01883;cath_funfam=3.40.50.620;cog=COG0215;ko=KO:K01883;ec_number=EC:6.1.1.16;pfam=tRNA-synt_1;smart=52374
1781_100325_scf_1006_c1 GeneMark.hmm-2 v1.05 CDS 761 1696 110.06 - 0 ID=1781_100325_scf_1006_c1_761_1696;translation_table=11;partial=5';product=ATP-dependent Zn protease;product_source=COG0465;cath_funfam=1.10.8.60;cog=COG0465;pfam=Peptidase_M4;smart=140990
1781_100325_scf_1007_c1 GeneMark.hmm-2 v1.05 CDS 264 1550 115.96 - 0 ID=1781_100325_scf_1007_c1_264_1550;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;pfam=DDE_Tnp_,DDE_Tnp_1_;smart=53098
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05 CDS 9 113 5.74 + 0 ID=1781_100325_scf_1008_c1_9_113;translation_table=11;product=elongation factor P;product_source=KO:K02356;cath_funfam=2.40.50.140;ko=KO:K02356;pfam=Elong-fact-P_;smart=50249;superfamily=SM00841
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05 CDS 118 546 41.51 + 0 ID=1781_100325_scf_1008_c1_118_546;translation_table=11;product=N utilization substance protein B;product_source=KO:K03625;cath_funfam=1.10.940.10;cog=COG0781;ko=KO:K03625;pfam=Nus;smart=48013;tigrfam=TIGR01951
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05 CDS 554 760 24.98 - 0 ID=1781_100325_scf_1008_c1_554_760;translation_table=11;product=sec-independent protein translocase protein TatA;product_source=KO:K03116;cog=COG1826;ko=KO:K03116;pfam=MttA_Hcf10;tigrfam=TIGR01411
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05 CDS 996 1562 81.24 + 0 ID=1781_100325_scf_1008_c1_996_1562;translation_table=11;product=pyrimidine operon attenuation protein/uracil phosphoribosyltransferase;product_source=KO:K02825;cath_funfam=3.40.50.2020;cog=COG2065;ko=KO:K02825;ec_number=EC:2.4.2.9;pfam=Pribosyltra;smart=53271
1781_100325_scf_1008_c1 GeneMark.hmm-2 v1.05 CDS 1559 1696 8.04 + 0 ID=1781_100325_scf_1008_c1_1559_1696;translation_table=11;partial=3';product=aspartate carbamoyltransferase catalytic subunit;product_source=KO:K00609;cath_funfam=3.40.50.1370;cog=COG0540;ko=KO:K00609;ec_number=EC:2.1.3.2;smart=53671
1781_100325_scf_1009_c1 GeneMark.hmm-2 v1.05 CDS 97 582 43.72 + 0 ID=1781_100325_scf_1009_c1_97_582;translation_table=11;product=HEAT repeat protein;product_source=COG1413;cath_funfam=1.25.10.20;cog=COG1413;pfam=HEAT_;smart=48371
1781_100325_scf_1009_c1 Prodigal v2.6.3 CDS 579 749 8.4 - 0 ID=1781_100325_scf_1009_c1_579_749;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;smart=57802
1781_100325_scf_1009_c1 GeneMark.hmm-2 v1.05 CDS 746 1693 36.99 - 0 ID=1781_100325_scf_1009_c1_746_1693;translation_table=11;product=integrase/recombinase XerD;product_source=KO:K04763;cath_funfam=1.10.150.130,1.10.443.10;cog=COG4974;ko=KO:K04763;pfam=Phage_int_SAM_,Phage_integras;smart=56349
1781_100325_scf_100_c1 GeneMark.hmm-2 v1.05 CDS 2 397 16.15 + 0 ID=1781_100325_scf_100_c1_2_397;translation_table=11;partial=5';product=two-component system nitrogen regulation response regulator GlnG;product_source=KO:K07712;cath_funfam=3.40.50.2300;cog=COG3437;ko=KO:K07712;pfam=Response_re;smart=52172
1781_100325_scf_100_c1 GeneMark.hmm-2 v1.05 CDS 670 2514 85.48 + 0 ID=1781_100325_scf_100_c1_670_2514;translation_table=11;product=signal transduction histidine kinase;product_source=COG0642;cath_funfam=1.10.287.130,2.60.15.10,3.30.565.10;cog=COG0642;pfam=HATPase_,HisK,dCache_;smart=103190,55021,55874;superfamily=SM00387,SM00388
1781_100325_scf_100_c1 GeneMark.hmm-2 v1.05 CDS 2522 3418 30.63 + 0 ID=1781_100325_scf_100_c1_2522_3418;translation_table=11;product=hypothetical protein;product_source=Hypo-rule applied;smart=81342
1781_100325_scf_100_c1 GeneMark.hmm-2 v1.05 CDS 3598 3972 11.31 - 0 ID=1781_100325_scf_100_c1_3598_3972;translation_table=11;partial=5';product=DNA-binding beta-propeller fold protein YncE;product_source=COG3391;cath_funfam=2.120.10.30;cog=COG3391;pfam=DUF512,NH;smart=101898
1781_100325_scf_1010_c1 GeneMark.hmm-2 v1.05 CDS 3 815 59.75 - 0 ID=1781_100325_scf_1010_c1_3_815;translation_table=11;partial=3';product=spermidine/putrescine transport system permease protein;product_source=KO:K11070;cath_funfam=1.10.3720.10;cog=COG1177;ko=KO:K11070;pfam=BPD_transp_;smart=161098
1781_100325_scf_1010_c1 GeneMark.hmm-2 v1.05 CDS 812 1696 92.29 - 0 ID=1781_100325_scf_1010_c1_812_1696;translation_table=11;partial=5';product=spermidine/putrescine transport system permease protein;product_source=KO:K11071;cath_funfam=1.10.3720.10;cog=COG1176;ko=KO:K11071;pfam=BPD_transp_;smart=161098
@hubin-keio has provided examples here: https://github.com/microbiomedata/pynmdc/tree/main/src/nmdc/test_data
Can you (@cmungall) provide a complete JSON version of the original example (Ga0185794_41)? I have pulled 1000 lines of a gff file from an early run of the annotation workflow and it is available here: https://github.com/microbiomedata/pynmdc/tree/main/src/nmdc/test_data/MetaG_annotation
I am still working on the converter. The unfinished version is here: https://github.com/microbiomedata/pynmdc
I would like to see a standard JSON output example before finalize the converter. Thanks.
@hubin-keio Quick observations:
test_data/MetaG_annotation/1781_100325_fa.json
should be separated by a comma. See https://github.com/microbiomedata/pynmdc/blob/main/src/nmdc/test_data/MetaG_annotation/1781_100325_fa.json#L37For some examples of the JSON, see here:
I added Deepak's examples to the repo in the examples folder: https://github.com/microbiomedata/nmdc-metadata/tree/master/examples
(not we also validate against all examples in this folder as unit tests and within github/travis CI)
Thanks for the comments. @deepakunni3, is your parser working? I have committed the last planned update before GSP this morning.
The "was_generated_by": "N/A"" field is still there in your examples. Maybe you want to remove it in your code?
Yes, the "N/A" was a placeholder to remind us that this information is missing and needs to be incorporated. Will remove from the script.
In the discussions in Aim1_standards channel it was mentioned on 1/9 "yes, never use values like "N/A", always make it an explicit json null, or simply omit the key altoogether." But I am fine with your parser solution as long as it is okay among Aim 1 and 3. Please put in Aim 2 channel the location of your parser once it is done so that we can process the GFFs. Aim 3 needs the JSONs ready by this Friday (1/15).
I am not sure what the expectation here is between your pynmdc converter vs my GFF3 converter.
Perhaps we can talk more on the technical call today.
Regarding the "N/A", thanks for clarifying. That makes sense. I can replace that with null
the was generated by
field should link to the MetagenomeAnnotation activity
we will better document this in the schema
(this answers @scanon's Q on the tech sync call)
The schema has inline docs detailing the mapping but we should provide a higher level guide. I will sketch out in this ticket and then this can be turned into docs on the site. I'm doing quickly so if anything is confusing, it's likely I made a mistake. I will also give examples in yaml but the json cognate should be obvious
Example
This GFF line represents the output of structural annotation (the prediction of a CDS on sequence Ga0185794_41). This is given a protein ID (skolemized from reference + coordinates).
Ga0185794_41_48_1037
. Col9 represents the outputs of functional annotation. @scanon @hubin-keio do I have this right?The core feature would be represented as an instance of https://microbiomedata.github.io/nmdc-metadata/docs/GenomeFeature
so our initial object would look like:
Note all IDs are prefixed to conform to the NMDC identifiers standards doc
Note the 'encodes' field, to link to a GeneProduct (this will always be a Protein for the current pipeline, but in future we may have ncRNA annotations)
https://microbiomedata.github.io/nmdc-metadata/docs/GeneProduct
Currently the GeneProduct field is fairly bare, but in future additional fields could be added - e.g. AA seq. The GeneProduct is what functional annotations are attached to
https://microbiomedata.github.io/nmdc-metadata/docs/FunctionalAnnotation
Each ';` separated section in col9 of the GFF would correspond to a separate annotation. The annotation links the gene product to the controlled term
Minimally this looks like:
For people familiar with the GO annotation system, each entry here represents a line in GAF or GPAD format files.
There would be one entry for each annotation. Above I am only showing the KO annotation
Note the string KO:K12960 is a key to a controlled term object. An example would be:
This is a minimal representation of the KEGG KO object. It can be linked to other controlled term objects. For example, it can include mappings to other systems (EC), parent/child hierarchies, links to pathways, etc, and these can all be traced to compounds chemical entities. However, we will only support simple KO search for GSP, so I will not detail that here. Please refer to #176 for using pathway knowledge to implement more advanced search.
Annotations to other systems (e.g. Pfam) are handled analogously. Please see the
id_prefixes
field in the schema to see the canonical ID prefix for each system.TODO: document how the feature connects to the metaG/T output
to be discussed