Add NER method to suggest ENVO triad from description

microbiomedata / sample-annotator

NMDC Sample Annotator

https://microbiomedata.github.io/sample-annotator/static/intro.html

5 stars 9 forks source link

Add NER method to suggest ENVO triad from description #3

Open cmungall opened 3 years ago

cmungall commented 3 years ago

https://github.com/cmungall/sample-annotator/tree/main/sample_annotator/text_mining

To start with, parse sample['description'], to populate sample['env_{broad_scale,local_scale,medium}'] if they are not already populated

I think this should be done by calling runner, but will need a pypi release https://github.com/monarch-initiative/runner/issues/9

or is it easier to just wrap oger directly for now

also for now we could just check in the nodes.tsv directly. See how we include mixs.json within the package

for now, be conservative and only use labels or exact synonyms

cmungall commented 3 years ago

As a first pass, just hardcode ENVO for all 3 fields regardless of package

Then for next pass, we will have a curated configuration file like this:

-
field: env_broad_scale
packages:
  - soil
termsets:
  - ontology: envo
     branches:
       - ENVO:01000254 ## environment system
     exclude_descendants_of:
       - ENVO:01001788 ##  marine ecosystem
-
field: env_local_scale
package: host-associated
termsets:
  - ontology: UBERON
...

that will customize which ontologies are used where

hrshdhgd commented 3 years ago

Just an FYI, OGER does not have a PyPI release either.

hrshdhgd commented 3 years ago

@cmungall , how do you envision the input file coming in for NER to look like: A tsv file within the project (locally i.e. ./text_mining/data/input) or remotely located (url) ?

I'm guessing the input tsv (or db) will be generated by @turbomam through his parsing work from the large XML?

cmungall commented 3 years ago

I answered @hrshdhgd's questions on our 1-on1. It's clear now that he doesn't have to worry about formats, the goal is to implement functionality within the python framework all you care about is datamodel