petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Explore KNIME functionality on articles and CProjects #38

Open petermr opened 4 years ago

petermr commented 4 years ago

@deadlyvices has been exploring this and reporting in email.

ACTION: copy any relevant past emails here...

petermr commented 4 years ago

Great, Very useful for getting a chemical frequency table. And also interpreting them.

What I'm doing is to try to find the chemical name column in each table so we have a vector of names. Will take an hour I suspect. P.

On Thu, Oct 31, 2019 at 2:47 PM Clyde Davies notifications@github.com wrote:

Now converting the names to structures!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZL3KZ7AZ4LNFGOEXLQRLVZRA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYBPSQ#issuecomment-548411338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYSMCECJDXRINWI3YDQRLVZRANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

What are you basing the vector on? The document/section/table?

deadlyvices commented 4 years ago

I'm currently using OPSIN against the names I've identified. Will be interesting to see what the hit rate is like

petermr commented 4 years ago

The column/s in the table.

On Thu, Oct 31, 2019 at 2:54 PM Clyde Davies notifications@github.com wrote:

What are you basing the vector on? The document/section/table?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS7NATXCSZ66J2YVTFTQRLWS5A5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYCKMQ#issuecomment-548414770, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7NEYLJ3D37QPKO6OLQRLWS5ANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

That will largely depend on whether the names are in ChEBI I suspect. Ambarish is looking them up against Wikidata (which includes ChEBI) and then PubChem. I expect you will get at least 97% hit rate (just a guess).

On Thu, Oct 31, 2019 at 2:58 PM Clyde Davies notifications@github.com wrote:

I'm currently using OPSIN against the names I've identified. Will be interesting to see what the hit rate is like

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS67YZX6YC33X4TZRE3QRLXCBA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYCZOQ#issuecomment-548416698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYQSOUHFBET53G5MHTQRLXCBANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

That'll be interesting to see. I know KNIME has a semantic web toolset, which I've just installed. might be interesting to see if we can get it querying wikidata

deadlyvices commented 4 years ago

I think though before that I should have a stab at creating this new tag set. I have been programming for many years so by general experience should give me a leg up with Eclipse and Java.

deadlyvices commented 4 years ago

8181 names couldn't be resolved, 2082 names could. So we're looking at just over a 20% hit rate now.

petermr commented 4 years ago

Where are these names from? OSCAR?

On Fri, Nov 1, 2019 at 12:40 PM Clyde Davies notifications@github.com wrote:

8181 names couldn't be resolved, 2082 names could. So we're looking at just over a 20% hit rate now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZFHNMNVODY66UFXATQRQPTVA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC2Z7IQ#issuecomment-548773794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6EIYMZE6SQ4GM46XDQRQPTVANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

From the KNIME help documentation:

Term To Structure Converts the terms of the specified term column to molecule strings formatted by a specified format using the OSCAR framework (see https://bitbucket.org/wwmm/oscar4/overview for details). Based on OSCAR version 4.2.2 using OPSIN 1.6, molecules are translates into e.g. SMILES strings. If a term can not be translated a missing cell is returned. After that, a bit of jiggery-pokery to convert the SMILES to structures

Sent from Mail for Windows 10

From: petermr Sent: 01 November 2019 13:12 To: petermr/CEVOpen Cc: Clyde Davies; Mention Subject: Re: [petermr/CEVOpen] Explore KNIME functionality on articles andCProjects (#38)

Where are these names from? OSCAR?

On Fri, Nov 1, 2019 at 12:40 PM Clyde Davies notifications@github.com wrote:

8181 names couldn't be resolved, 2082 names could. So we're looking at just over a 20% hit rate now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZFHNMNVODY66UFXATQRQPTVA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC2Z7IQ#issuecomment-548773794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6EIYMZE6SQ4GM46XDQRQPTVANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

deadlyvices commented 4 years ago

I enhanced the workflow and let it rip on oil1000. My works machine is probably just about up to the task of doing this. One of the enhancements was to add a topic analysis component. I filtered the article abstracts for mentions of Rosmarinus officinalis and ran it against the filtered documents. This is an example of what comes out: image

deadlyvices commented 4 years ago

I also tried linking terms together in a network view. Pretty crude but it demonstrates what we can do: image

EmanuelFaria commented 4 years ago

That is amazing!!

... but can you set it to be my new music visualizer??

I’m thinking techno https://youtu.be/b8qBUza1pO0

Sent with GitHawk

EmanuelFaria commented 4 years ago

Seriously though, this would be so useful to me to see the correlation between activities such as “irritant” linked to essential oils, or better yet their common constituents. Knowing what constituents are the main activity factors is a game changer

Sent with GitHawk

deadlyvices commented 4 years ago

It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so.

petermr commented 4 years ago

thanks,

On Mon, Nov 4, 2019 at 6:12 PM Clyde Davies notifications@github.com wrote:

It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS3VSNBFW7PWRCL3PLTQSBQZ3A5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDAGHCI#issuecomment-549479305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS77UGTAZHT22UR5RH3QSBQZ3ANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so. It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so.

@deadlyvices hot damn that is incredible

Sent with GitHawk

deadlyvices commented 4 years ago

OK, progress to date:

If you want to see this in action I suggest an online session

deadlyvices commented 4 years ago

This is my mucking about with the network view. I've filtered out relationships that have a document co-occurrence <20 and sentence co-occurrence <3 image So you can see it gives some idea about how the major concepts inter-relate

deadlyvices commented 4 years ago

I'm just letting it rip on oil1000 now. 25,584,800 terms identified so far! "I am just going outside and may be some time" ...

petermr commented 4 years ago

This is great. (These types of graph get unreadable pretty quicky with scale). I am making progress on the extraction table-columns which will give several hundred plant-compoundVectors. Currently analysing the table structures. When is a good time to talk?

petermr commented 4 years ago

@deadlyvices wrote: "oil1000 now. 25,584,800 terms identified so far! " That's an awful lot. It's 25K terms PER ARTICLE. There aren't that number of words in most articles. ??

deadlyvices commented 4 years ago

Sorry - 25M term pairs. That's what I meant! So that generates the co-occurrence counts.

deadlyvices commented 4 years ago

I am currently running on a quad-processor workstation with 32Gb of RAM. And I've maxed-out both processor and memory. Might be time to rethink this approach.

deadlyvices commented 4 years ago

Well it's finished now. Took a couple of hours.

I appreciate the dictionaries themselves are a work in progress, but there appear to be a few versions around with numerical codes (dates?) suffixed. Do we really need to do this if we're using git? The tip revision ought be what we use.

petermr commented 4 years ago

On Tue, Nov 5, 2019 at 1:14 PM Clyde Davies notifications@github.com wrote:

Well it's finished now. Took a couple of hours.

I appreciate the dictionaries themselves are a work in progress, but there appear to be a few versions around with numerical codes (dates?) suffixed. Do we really need to do this if we're using git? The tip revision ought be what we use.

The dictionaries will be used outside the GIT platform. If people download a dictionary they may copy it elsewhere. Hardcoded version numbers are the best way of preserving the identity. Yes we also need checksums but these can be fragile.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS2FCXFHMVLQP6ND26LQSFWRZA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDCYWRY#issuecomment-549817159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4HG2KBVX3RWDGKO73QSFWRZANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

Probably a good idea for us to meet and talk about what to do next. And to make sure you can now run this thing on your Mac.

petermr commented 4 years ago

On Thu, Nov 7, 2019 at 9:54 AM Clyde Davies notifications@github.com wrote:

Probably a good idea for us to meet

Physically or remotely?

and talk about what to do next.

Yes

And to make sure you can now run this thing on your Mac.

that will be good.

I expect to have a significant number of composition tables soon and that will allow us to do real science!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZJ2TUGBNUI5YOICOLQSPQVRA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDL3DZY#issuecomment-551006695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS43JVEL5J3NYGSCVCTQSPQVRANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

Remotely.
I've just tried processing the composition tables. Are they of a fixed format, or variable structure?

petermr commented 4 years ago

Variable at present. Will try to reduce them to 2-column tables

On Thu, Nov 7, 2019 at 1:05 PM Clyde Davies notifications@github.com wrote:

Remotely. I've just tried processing the composition tables. Are they of a fixed format, or variable structure?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZHWWHPXBEM5IBMAGTQSQHBJA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDMKYEI#issuecomment-551070737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3BHA6HUCV3R2H5CJLQSQHBJANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

How's this effort going? I haven't had much time to explore the handling of variable column names lately. I know it's going to be tricky.

petermr commented 4 years ago

Have a powerful template schema and making progress. Been with Henry Rzepa today

On Wed, 20 Nov 2019, 10:48 Clyde Davies, notifications@github.com wrote:

How's this effort going? I haven't had much time to explore the handling of variable column names lately. I know it's going to be tricky.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSYXX562TYULETZULS3QUUIXPA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEERRW2I#issuecomment-555948905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6GIJ2NJAHR5NHI4TDQUUIXPANCNFSM4JAOYUAQ .

deadlyvices commented 4 years ago

Cool! Give Henry my regards: I think we met years ago back in that Cambridge hackathon your organised.

On Wed, Nov 20, 2019 at 2:42 PM petermr notifications@github.com wrote:

Have a powerful template schema and making progress. Been with Henry Rzepa today

On Wed, 20 Nov 2019, 10:48 Clyde Davies, notifications@github.com wrote:

How's this effort going? I haven't had much time to explore the handling of variable column names lately. I know it's going to be tricky.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSYXX562TYULETZULS3QUUIXPA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEERRW2I#issuecomment-555948905 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFTCS6GIJ2NJAHR5NHI4TDQUUIXPANCNFSM4JAOYUAQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMXXYEIR3X7TRCKMBHLQUVEHDA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESGH4A#issuecomment-556033008, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMWU44AZ4NQHF5VD7I3QUVEHDANCNFSM4JAOYUAQ .

-- Clyde