petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Schema & Scraping helpers #9

Closed EmanuelFaria closed 4 years ago

EmanuelFaria commented 4 years ago

### Stuff to help us identify useful Text and Terms

EmanuelFaria commented 4 years ago

Names stems for drugs and chemicals https://druginfo.nlm.nih.gov/drugportal/jsp/drugportal/DrugNameGenericStems.jsp

EmanuelFaria commented 4 years ago

Categories for Drug Activities — No ID numbers though https://druginfo.nlm.nih.gov/drugportal/drug/categories

petermr commented 4 years ago

On Thu, Aug 29, 2019 at 5:27 PM Emanuel Faria notifications@github.com wrote:

Names stems for drugs and chemicals

https://druginfo.nlm.nih.gov/drugportal/jsp/drugportal/DrugNameGenericStems.jsp

very useful for generic drugs. Thanks. Needs me to mend the regex (regular expression) parser.

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCSZFSLPZC35EAM7WHL3QG72H5A5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PCB2Y#issuecomment-526262507, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2SQDM56ESQNFJNKGTQG72H5ANCNFSM4ISEKRCQ .

-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".

Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

petermr commented 4 years ago

Useful. We can hack activitties out of this. Messy - the sort of thing while watching cricket

On Thu, Aug 29, 2019 at 5:33 PM Emanuel Faria notifications@github.com wrote:

Categories for Drug Activities — No ID numbers though https://druginfo.nlm.nih.gov/drugportal/drug/categories

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCSZUUCY7WHY5IEZ3R5LQG727HA5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PCS3A#issuecomment-526264684, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5OCPRK5AON4SCUDWTQG727HANCNFSM4ISEKRCQ .

-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".

Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

EmanuelFaria commented 4 years ago

@petermr I Found some more schema links. Let me know if you want me to keep posting these. I really don't know how to help you identify things we can connect to or pull from easily.

I googled "taxonomy of pharmalogical Activities" and ended up here: http://apps.who.int/medicinedocs/en/d/Js4895e/5.html

Maybe this contacting this journal could help? https://www.tandfonline.com/doi/full/10.1080/13880209.2017.1323225

https://en.wikipedia.org/wiki/Semantic_Web

https://www.mediawiki.org/wiki/Wikibase/EntityData

https://schema.org/MedicalEnumeration https://schema.org/DrugClass https://schema.org/DietarySupplement

https://schema.org/docs/tree.jsonld https://schema.org/version/3.9/schema-all.html https://schema.org/docs/releases.html https://github.com/schemaorg/schemaorg/issues/2306

petermr commented 4 years ago

On Mon, Sep 2, 2019 at 9:16 PM Emanuel Faria notifications@github.com wrote:

@petermr https://github.com/petermr I Found some more schema links. Let me know if you want me to keep posting these. I really don't know how to help you identify things we can connect to or pull from easily.

Slow down on this! there are a million terms in UMLS/MesH and we need abut 1%. We'll talk. More important to see if there is a consistent structure to the papers. That's what will be valuable. if we get sections on activity constitution plant thats what we need. (I have used complex taxonomies in the past. they're sometimes useful but a simple list of words in wikipedia is the most valuable). One problem of multiple taxonomies is that they don't map onto each other.

I googled "taxonomy of pharmalogical Activities" and ended up here: http://apps.who.int/medicinedocs/en/d/Js4895e/5.html

Maybe this contacting this journal could help? https://www.tandfonline.com/doi/full/10.1080/13880209.2017.1323225

No

https://en.wikipedia.org/wiki/Semantic_Web

https://www.mediawiki.org/wiki/Wikibase/EntityData

schema.org is valuable and works closely with Wikidata.

https://schema.org/MedicalEnumeration https://schema.org/DrugClass https://schema.org/DietarySupplement

https://schema.org/docs/tree.jsonld https://schema.org/version/3.9/schema-all.html https://schema.org/docs/releases.html schemaorg/schemaorg#2306 https://github.com/schemaorg/schemaorg/issues/2306

The key thing is that hierarchies can be expanded or contracted according to granuality. Thus we may wish to search for "infective diseases" and have that automatically expanded to - say 300diseases or contrariwise find terms in text and want to know the genral sort. I'll explain But first we'll do it with plants and I need your help.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCSZTWP7MFRFHRBBCX5LQHVYAJA5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WP3MA#issuecomment-527236528, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4R4DWYECSGV345C3LQHVYAJANCNFSM4ISEKRCQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago
        Whew. That’s great. You lead, I’ll follow.---- On Mon, 02 Sep 2019 17:28:06 -0400  notifications@github.com<notifications@github.com> wrote ----On Mon, Sep 2, 2019 at 9:16 PM Emanuel Faria <notifications@github.com>  

wrote:

@petermr https://github.com/petermr I Found some more schema links. Let
me know if you want me to keep posting these. I really don't know how to
help you identify things we can connect to or pull from easily.

Slow down on this! there are a million terms in UMLS/MesH and we need abut
1%.
We'll talk.
More important to see if there is a consistent structure to the papers.
That's what will be valuable.
if we get sections on
activity
constitution
plant
thats what we need.
(I have used complex taxonomies in the past. they're sometimes useful but a
simple list of words in wikipedia is the most valuable). One problem of
multiple taxonomies is that they don't map onto each other.

I googled "taxonomy of pharmalogical Activities" and ended up here:
http://apps.who.int/medicinedocs/en/d/Js4895e/5.html

Maybe this contacting this journal could help?
https://www.tandfonline.com/doi/full/10.1080/13880209.2017.1323225

No

https://en.wikipedia.org/wiki/Semantic_Web

https://www.mediawiki.org/wiki/Wikibase/EntityData

schema.org is valuable and works closely with Wikidata.

https://schema.org/MedicalEnumeration
https://schema.org/DrugClass
https://schema.org/DietarySupplement

https://schema.org/docs/tree.jsonld
https://schema.org/version/3.9/schema-all.html
https://schema.org/docs/releases.html
schemaorg/schemaorg#2306
https://github.com/schemaorg/schemaorg/issues/2306

The key thing is that hierarchies can be expanded or contracted according
to granuality. Thus
we may wish to search for "infective diseases" and have that automatically
expanded to - say 300diseases or contrariwise find terms in text and want
to know the genral sort.
I'll explain
But first we'll do it with plants and I need your help.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCSZTWP7MFRFHRBBCX5LQHVYAJA5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WP3MA#issuecomment-527236528,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFTCS4R4DWYECSGV345C3LQHVYAJANCNFSM4ISEKRCQ
.

--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

petermr commented 4 years ago

We have 3200 diseases in ContentMine already. Some are not well organized for searching. I'm keen that we develop so simple Machine Learning (word2vec or tensorflow or keras) so you can recognize from environment (e.g "12 patients suffering from acne vulgaris, 10 from eczema and 9 from acute vogonitis".

You've never hear of vogonitis? Nor have I - I made it up, but you can tell it's a disease. This is called a Hearst Pattern and the software will pick it up. We can do the same for chemicals and plants.

On Mon, Sep 2, 2019 at 10:34 PM Emanuel Faria notifications@github.com wrote:

Whew. That’s great. You lead, I’ll follow.---- On Mon, 02 Sep 2019 17:28:06 -0400 notifications@github.comnotifications@github.com wrote ----On Mon, Sep 2, 2019 at 9:16 PM Emanuel Faria notifications@github.com

wrote:

@petermr https://github.com/petermr I Found some more schema links. Let me know if you want me to keep posting these. I really don't know how to help you identify things we can connect to or pull from easily.

Slow down on this! there are a million terms in UMLS/MesH and we need abut 1%. We'll talk. More important to see if there is a consistent structure to the papers. That's what will be valuable. if we get sections on activity constitution plant thats what we need. (I have used complex taxonomies in the past. they're sometimes useful but a simple list of words in wikipedia is the most valuable). One problem of multiple taxonomies is that they don't map onto each other.

I googled "taxonomy of pharmalogical Activities" and ended up here: http://apps.who.int/medicinedocs/en/d/Js4895e/5.html

Maybe this contacting this journal could help? https://www.tandfonline.com/doi/full/10.1080/13880209.2017.1323225

No

https://en.wikipedia.org/wiki/Semantic_Web

https://www.mediawiki.org/wiki/Wikibase/EntityData

schema.org is valuable and works closely with Wikidata.

https://schema.org/MedicalEnumeration https://schema.org/DrugClass https://schema.org/DietarySupplement

https://schema.org/docs/tree.jsonld https://schema.org/version/3.9/schema-all.html https://schema.org/docs/releases.html schemaorg/schemaorg#2306 https://github.com/schemaorg/schemaorg/issues/2306

The key thing is that hierarchies can be expanded or contracted according to granuality. Thus we may wish to search for "infective diseases" and have that automatically expanded to - say 300diseases or contrariwise find terms in text and want to know the genral sort. I'll explain But first we'll do it with plants and I need your help.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCSZTWP7MFRFHRBBCX5LQHVYAJA5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WP3MA#issuecomment-527236528>,

or mute the thread < https://github.com/notifications/unsubscribe-auth/AAFTCS4R4DWYECSGV345C3LQHVYAJANCNFSM4ISEKRCQ>

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCS6KMW6B7DJY2JE2K3TQHWBIBA5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WSIDY#issuecomment-527246351, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7R3WEC3E6M2SWT6MTQHWBIBANCNFSM4ISEKRCQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

Vogonitis description: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6ZhrKHEQZhfNFlx43aIGGjvZKIUs8smMKWEHCXpCA-rN0Y8-cjAlrBo5U

petermr commented 4 years ago

:-)

On Tue, Sep 3, 2019 at 2:13 PM Emanuel Faria notifications@github.com wrote:

Vogonitis description:

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6ZhrKHEQZhfNFlx43aIGGjvZKIUs8smMKWEHCXpCA-rN0Y8-cjAlrBo5U

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/9?email_source=notifications&email_token=AAFTCS2FCF2FLI7WIZHNPS3QHZPFZA5CNFSM4ISEKRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YESJA#issuecomment-527452452, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4HQG5VC6EW2642XFLQHZPFZANCNFSM4ISEKRCQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK