ncbo / BioPortal-to-KGX

Assemble a BioPortal Knowledge Graph
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Missing IRIs and metadata from many transforms #66

Open caufieldjh opened 2 years ago

caufieldjh commented 2 years ago

Many transforms appear to be missing descriptions, IRIs, and possibly other fields populated in the previous set of transforms. Will need to verify the JSON -> TSV step is populating fields as expected, particularly name and description.

caufieldjh commented 2 years ago

example with ICD10PCS.

Previously:

id  category    name    provided_by aggregator_knowledge_source iri object  predicate   primary_knowledge_source    relation    same_as subject
ICD10PCS:0WJ34Z biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/0WJ34Z
ICD10PCS:079430Z    biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/079430Z
ICD10PCS:0FPD4KZ    biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/0FPD4KZ
ICD10PCS:2W3HX3Z    biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/2W3HX3Z
ICD10PCS:2W56X1Z    biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/2W56X1Z
ICD10PCS:01QC3ZZ    biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/01QC3ZZ
ICD10PCS:2W0MX7Z    biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/2W0MX7Z
ICD10PCS:0SJL3Z biolink:Procedure|biolink:OntologyClass     BioPortal       http://purl.bioontology.org/ontology/ICD10PCS/0SJL3Z

Currently:

$ head transformed/ontologies/ICD10PCS/ICD10PCS_21_nodes.tsv 
id      category        name    description     provided_by
ICD10PCS:0WJ34Z biolink:Procedure                       BioPortal
ICD10PCS:079430Z        biolink:Procedure                       BioPortal
ICD10PCS:0FPD4KZ        biolink:Procedure                       BioPortal
ICD10PCS:2W3HX3Z        biolink:Procedure                       BioPortal
ICD10PCS:2W56X1Z        biolink:Procedure                       BioPortal
ICD10PCS:01QC3ZZ        biolink:Procedure                       BioPortal
ICD10PCS:2W0MX7Z        biolink:Procedure                       BioPortal
ICD10PCS:0SJL3Z biolink:Procedure                       BioPortal
ICD10PCS:2W6CX0Z        biolink:Procedure                       BioPortal
caufieldjh commented 2 years ago

This may also be a good juncture to see if the values added to edgefiles in primary_knowledge_source can be used in the nodelists too

caufieldjh commented 2 years ago

Another example, with BFO.

Previous transform:

id  category    name    description provided_by aggregator_knowledge_source iri object  predicate   primary_knowledge_source    relation    same_as subject
BFO:0000019 biolink:OntologyClass   quality     BioPortal       http://purl.obolibrary.org/obo/BFO_0000019
BFO:0000015 biolink:OntologyClass   process p is a process = Def. p is an occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. (axiom label in BFO2 Reference: [083-003])    BioPortal       http://purl.obolibrary.org/obo/BFO_0000015
BFO:0000016 biolink:OntologyClass   disposition     BioPortal       http://purl.obolibrary.org/obo/BFO_0000016
BFO:0000017 biolink:OntologyClass   realizable entity       BioPortal       http://purl.obolibrary.org/obo/BFO_0000017
BFO:0000018 biolink:OntologyClass   zero-dimensional spatial region     BioPortal       http://purl.obolibrary.org/obo/BFO_0000018
BFO:0000011 biolink:OntologyClass   spatiotemporal region       BioPortal       http://purl.obolibrary.org/obo/BFO_0000011
IAO:0000116 biolink:OntologyClass   editor note     BioPortal       http://purl.obolibrary.org/obo/IAO_0000116
IAO:0000117 biolink:OntologyClass   term editor     BioPortal       http://purl.obolibrary.org/obo/IAO_0000117
BFO:0000134 biolink:OntologyClass           BioPortal       http://purl.obolibrary.org/obo/BFO_0000134
BFO:0000179 biolink:OntologyClass   BFO OWL specification label Relates an entity in the ontology to the name of the variable that is used to represent it in the code that generates the BFO OWL file from the lispy specification.    BioPortal       http://purl.obolibrary.org/obo/BFO_0000179
IAO:0000115 biolink:OntologyClass   definition      BioPortal       http://purl.obolibrary.org/obo/IAO_0000115
IAO:0000112 biolink:OntologyClass   example of usage        BioPortal       http://purl.obolibrary.org/obo/IAO_0000112
IAO:0000111 biolink:OntologyClass   editor preferred term       BioPortal       http://purl.obolibrary.org/obo/IAO_0000111
IAO:0000232 biolink:OntologyClass   curator note        BioPortal       http://purl.obolibrary.org/obo/IAO_0000232
BFO:0000008 biolink:OntologyClass   temporal region     BioPortal       http://purl.obolibrary.org/obo/BFO_0000008

Current transform:

id  category    name    description provided_by
BFO:0000019 biolink:OntologyClass   quality     Basic Formal Ontology
BFO:0000015 biolink:OntologyClass   process p is a process = Def. p is an occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. (axiom label in BFO2 Reference: [083-003])    Basic Formal Ontology
BFO:0000016 biolink:OntologyClass   disposition     Basic Formal Ontology
BFO:0000017 biolink:OntologyClass   realizable entity       Basic Formal Ontology
BFO:0000018 biolink:OntologyClass   zero-dimensional spatial region     Basic Formal Ontology
BFO:0000011 biolink:OntologyClass   spatiotemporal region       Basic Formal Ontology
IAO:0000116 biolink:OntologyClass   editor note     Basic Formal Ontology
IAO:0000117 biolink:OntologyClass   term editor     Basic Formal Ontology
BFO:0000134 biolink:OntologyClass           Basic Formal Ontology
BFO:0000179 biolink:OntologyClass   BFO OWL specification label Relates an entity in the ontology to the name of the variable that is used to represent it in the code that generates the BFO OWL file from the lispy specification.    Basic Formal Ontology
IAO:0000115 biolink:OntologyClass   definition      Basic Formal Ontology
IAO:0000112 biolink:OntologyClass   example of usage        Basic Formal Ontology
IAO:0000111 biolink:OntologyClass   editor preferred term       Basic Formal Ontology
IAO:0000232 biolink:OntologyClass   curator note        Basic Formal Ontology
BFO:0000008 biolink:OntologyClass   temporal region     Basic Formal Ontology

The name field is still populated, so that's great, but provided_by is now the name of the ontology instead of the aggregator knowledge source (probably also fine, but should include version, too), extra headings are different (an improvement, and perhaps something KGX is doing?), and iri isn't there at all. Would really prefer to have IRIs present so nodes may be mapped back to source BP ontologies.

caufieldjh commented 2 years ago

This may be due to a difference in bmt or in Biolink Model itself.

caufieldjh commented 2 years ago

Here's one confirmed difference: if I run a transform like the following

                        kgx.cli.transform(inputs=[repaired_outpath],
                            input_format='obojson',
                            output=outpath,
                            output_format='tsv',
                            stream=True,
                            knowledge_sources=[("aggregator_knowledge_source", "BioPortal"),
                                                ("primary_knowledge_source", primary_knowledge_source)])

then aggregator_knowledge_source is not added to the node or edge file, 'primary_knowledge_source' is added to the edgefile but the corresponding values are included under provided_by.

caufieldjh commented 2 years ago

This isn't really a blocker - the transforms should merge perfectly well without IRIs present - so if it's related to kgx or bmt then perhaps it can be solved as part of the kg-bioportal merge.

caufieldjh commented 1 year ago

Metadata is missing in new transforms; provided_by is back to providing only the source file name. Example from ODNAE:

id  category    name    description provided_by
CHEBI:25698 biolink:ChemicalSubstance   ether   A compound ROR (where R is not H).  ODNAE_3_relaxed.json
GO:0010646  biolink:BiologicalProcess   regulation of cell communication    Any process that modulates the frequency, rate or extent of cell communication. Cell communication is the process that mediates interactions between a cell and its surroundings. Encompasses interactions such as signaling or attachment between one cell and another cell, between a cell and an extracellular matrix, or between a cell and any other aspect of its environment.    ODNAE_3_relaxed.json
GO:0010647  biolink:BiologicalProcess   positive regulation of cell communication   Any process that increases the frequency, rate or extent of cell communication. Cell communication is the process that mediates interactions between a cell and its surroundings. Encompasses interactions such as signaling or attachment between one cell and another cell, between a cell and an extracellular matrix, or between a cell and any other aspect of its environment.    ODNAE_3_relaxed.json
ODNAE:0000100   biolink:NamedThing  zidovudine (Retrovir)-associated neuropathy AE      ODNAE_3_relaxed.json
DRON:00021698   biolink:Drug    Disulfiram Oral Tablet      ODNAE_3_relaxed.json

Will make this its own issue because I think I have a solution.