petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Create activity dictionary from EssoildDB 1.0 #20

Closed petermr closed 4 years ago

petermr commented 4 years ago

Ambarish has created a list of activity terms from E1.0. These are (I think) in https://github.com/petermr/CEVOpen/blob/master/activitiesNew20190924.xml There are a concentration of useful terms, but some are false positives and some are messy. Also the Wikidata links are frequently to scientific articles and not useful primary defintions.

Action: PMR and possibly @mannyrules - Manually

The dictionary will be committed as dictionary/activity.xml

ambarishK commented 4 years ago

This is the activity dictionary created from EssoilDB1.0 ( directories are restructured )

https://github.com/petermr/CEVOpen/blob/master/dictionary/activity/raw/activitiesNew20190924.xml

petermr commented 4 years ago

From Manny

I’m at line number 355 of 524 lines of Activity types. (Some duplication because the data also includes specific Action Targets for some Activities— i.e., a separate line for any activity that has more than one target).

I realized as I went along I needed to reformat some data in order to provide more distinct Activity names that Ambrish will lookup to compare. Some of what Essoil treated as an Activity, I am forced to treat as an “Activity Class” having more specific, true or sub- “Activities” (which I’ve been calling ActionTypes) beneath them.

Long story short: It’s not as straight-forward as Kingdom, phylum, class, order, family, genus, species. (Lucky botanists.)

I’m hoping that Ambrish+Wikidata can clean a majority of this up, but lacking a “one true way” to organize this stuff, someone is going to have to make a judgement call. Maybe you could “pay” some scholarly specialists’ with “credit” for help.

I’ll be back at it tomorrow and give you a progress report. 

Manny

petermr commented 4 years ago

Manny, don't sweat it. The dictionaries do NOT include a classification. That is done by Wikidata. All we need is:

Optional

ambarishK commented 4 years ago

Hello sir! Please point me to the file (upon which cleaning is performed). This file has only 195 lines.

petermr commented 4 years ago

You mean CEVOpen https://github.com/petermr/CEVOpen/dictionary https://github.com/petermr/CEVOpen/tree/master/dictionary/activity https://github.com/petermr/CEVOpen/tree/master/dictionary/activity/raw https://github.com/petermr/CEVOpen/tree/master/dictionary/activity/raw/ activitiesNew20190924.xml ? Just identify wikidata IDs that are false positives (e.g. "scientific article") and remove the wikidata ID and the description but keep the entry.

On Wed, Oct 2, 2019 at 10:57 AM Ambarish Kumar notifications@github.com wrote:

Hello sir! Please point me to the file (upon which cleaning is performed). This file https://github.com/petermr/CEVOpen/blob/master/dictionary/activity/raw/activitiesNew20190924.xml has only lines

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/20?email_source=notifications&email_token=AAFTCS7AEOPFRQUD2CYN25DQMRV7HA5CNFSM4I3IK2SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAEG47I#issuecomment-537423485, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS66TP45XAFY2T2H2GTQMRV7HANCNFSM4I3IK2SA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir.