Open petermr opened 4 years ago
Great, Very useful for getting a chemical frequency table. And also interpreting them.
What I'm doing is to try to find the chemical name column in each table so we have a vector of names. Will take an hour I suspect. P.
On Thu, Oct 31, 2019 at 2:47 PM Clyde Davies notifications@github.com wrote:
Now converting the names to structures!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZL3KZ7AZ4LNFGOEXLQRLVZRA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYBPSQ#issuecomment-548411338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYSMCECJDXRINWI3YDQRLVZRANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
What are you basing the vector on? The document/section/table?
I'm currently using OPSIN against the names I've identified. Will be interesting to see what the hit rate is like
The column/s in the table.
On Thu, Oct 31, 2019 at 2:54 PM Clyde Davies notifications@github.com wrote:
What are you basing the vector on? The document/section/table?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS7NATXCSZ66J2YVTFTQRLWS5A5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYCKMQ#issuecomment-548414770, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7NEYLJ3D37QPKO6OLQRLWS5ANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
That will largely depend on whether the names are in ChEBI I suspect. Ambarish is looking them up against Wikidata (which includes ChEBI) and then PubChem. I expect you will get at least 97% hit rate (just a guess).
On Thu, Oct 31, 2019 at 2:58 PM Clyde Davies notifications@github.com wrote:
I'm currently using OPSIN against the names I've identified. Will be interesting to see what the hit rate is like
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS67YZX6YC33X4TZRE3QRLXCBA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYCZOQ#issuecomment-548416698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYQSOUHFBET53G5MHTQRLXCBANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
That'll be interesting to see. I know KNIME has a semantic web toolset, which I've just installed. might be interesting to see if we can get it querying wikidata
I think though before that I should have a stab at creating this new tag set. I have been programming for many years so by general experience should give me a leg up with Eclipse and Java.
8181 names couldn't be resolved, 2082 names could. So we're looking at just over a 20% hit rate now.
Where are these names from? OSCAR?
On Fri, Nov 1, 2019 at 12:40 PM Clyde Davies notifications@github.com wrote:
8181 names couldn't be resolved, 2082 names could. So we're looking at just over a 20% hit rate now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZFHNMNVODY66UFXATQRQPTVA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC2Z7IQ#issuecomment-548773794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6EIYMZE6SQ4GM46XDQRQPTVANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
From the KNIME help documentation:
Term To Structure Converts the terms of the specified term column to molecule strings formatted by a specified format using the OSCAR framework (see https://bitbucket.org/wwmm/oscar4/overview for details). Based on OSCAR version 4.2.2 using OPSIN 1.6, molecules are translates into e.g. SMILES strings. If a term can not be translated a missing cell is returned. After that, a bit of jiggery-pokery to convert the SMILES to structures
Sent from Mail for Windows 10
From: petermr Sent: 01 November 2019 13:12 To: petermr/CEVOpen Cc: Clyde Davies; Mention Subject: Re: [petermr/CEVOpen] Explore KNIME functionality on articles andCProjects (#38)
Where are these names from? OSCAR?
On Fri, Nov 1, 2019 at 12:40 PM Clyde Davies notifications@github.com wrote:
8181 names couldn't be resolved, 2082 names could. So we're looking at just over a 20% hit rate now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZFHNMNVODY66UFXATQRQPTVA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC2Z7IQ#issuecomment-548773794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6EIYMZE6SQ4GM46XDQRQPTVANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I enhanced the workflow and let it rip on oil1000. My works machine is probably just about up to the task of doing this. One of the enhancements was to add a topic analysis component. I filtered the article abstracts for mentions of Rosmarinus officinalis and ran it against the filtered documents. This is an example of what comes out:
I also tried linking terms together in a network view. Pretty crude but it demonstrates what we can do:
That is amazing!!
... but can you set it to be my new music visualizer??
I’m thinking techno https://youtu.be/b8qBUza1pO0
Sent with GitHawk
Seriously though, this would be so useful to me to see the correlation between activities such as “irritant” linked to essential oils, or better yet their common constituents. Knowing what constituents are the main activity factors is a game changer
Sent with GitHawk
It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so.
thanks,
On Mon, Nov 4, 2019 at 6:12 PM Clyde Davies notifications@github.com wrote:
It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS3VSNBFW7PWRCL3PLTQSBQZ3A5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDAGHCI#issuecomment-549479305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS77UGTAZHT22UR5RH3QSBQZ3ANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so. It's going to get even more colourful! One of the obstacles up to now has been that I've had to force fit our classifications into a standard set of tags, which aren't good. I had a quiet afternoon so I got my head around the custom tag sets in KNIME and wrote a little Java plugin to create our own. I'll get these wired in over the next day or so.
@deadlyvices hot damn that is incredible
Sent with GitHawk
OK, progress to date:
If you want to see this in action I suggest an online session
This is my mucking about with the network view. I've filtered out relationships that have a document co-occurrence <20 and sentence co-occurrence <3 So you can see it gives some idea about how the major concepts inter-relate
I'm just letting it rip on oil1000 now. 25,584,800 terms identified so far! "I am just going outside and may be some time" ...
This is great. (These types of graph get unreadable pretty quicky with scale). I am making progress on the extraction table-columns which will give several hundred plant-compoundVectors. Currently analysing the table structures. When is a good time to talk?
@deadlyvices wrote: "oil1000 now. 25,584,800 terms identified so far! " That's an awful lot. It's 25K terms PER ARTICLE. There aren't that number of words in most articles. ??
Sorry - 25M term pairs. That's what I meant! So that generates the co-occurrence counts.
I am currently running on a quad-processor workstation with 32Gb of RAM. And I've maxed-out both processor and memory. Might be time to rethink this approach.
Well it's finished now. Took a couple of hours.
I appreciate the dictionaries themselves are a work in progress, but there appear to be a few versions around with numerical codes (dates?) suffixed. Do we really need to do this if we're using git? The tip revision ought be what we use.
On Tue, Nov 5, 2019 at 1:14 PM Clyde Davies notifications@github.com wrote:
Well it's finished now. Took a couple of hours.
I appreciate the dictionaries themselves are a work in progress, but there appear to be a few versions around with numerical codes (dates?) suffixed. Do we really need to do this if we're using git? The tip revision ought be what we use.
The dictionaries will be used outside the GIT platform. If people download a dictionary they may copy it elsewhere. Hardcoded version numbers are the best way of preserving the identity. Yes we also need checksums but these can be fragile.
—
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS2FCXFHMVLQP6ND26LQSFWRZA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDCYWRY#issuecomment-549817159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4HG2KBVX3RWDGKO73QSFWRZANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Probably a good idea for us to meet and talk about what to do next. And to make sure you can now run this thing on your Mac.
On Thu, Nov 7, 2019 at 9:54 AM Clyde Davies notifications@github.com wrote:
Probably a good idea for us to meet
Physically or remotely?
and talk about what to do next.
Yes
And to make sure you can now run this thing on your Mac.
that will be good.
I expect to have a significant number of composition tables soon and that will allow us to do real science!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZJ2TUGBNUI5YOICOLQSPQVRA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDL3DZY#issuecomment-551006695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS43JVEL5J3NYGSCVCTQSPQVRANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Remotely.
I've just tried processing the composition tables. Are they of a fixed format, or variable structure?
Variable at present. Will try to reduce them to 2-column tables
On Thu, Nov 7, 2019 at 1:05 PM Clyde Davies notifications@github.com wrote:
Remotely. I've just tried processing the composition tables. Are they of a fixed format, or variable structure?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZHWWHPXBEM5IBMAGTQSQHBJA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDMKYEI#issuecomment-551070737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3BHA6HUCV3R2H5CJLQSQHBJANCNFSM4JAOYUAQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
How's this effort going? I haven't had much time to explore the handling of variable column names lately. I know it's going to be tricky.
Have a powerful template schema and making progress. Been with Henry Rzepa today
On Wed, 20 Nov 2019, 10:48 Clyde Davies, notifications@github.com wrote:
How's this effort going? I haven't had much time to explore the handling of variable column names lately. I know it's going to be tricky.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSYXX562TYULETZULS3QUUIXPA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEERRW2I#issuecomment-555948905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6GIJ2NJAHR5NHI4TDQUUIXPANCNFSM4JAOYUAQ .
Cool! Give Henry my regards: I think we met years ago back in that Cambridge hackathon your organised.
On Wed, Nov 20, 2019 at 2:42 PM petermr notifications@github.com wrote:
Have a powerful template schema and making progress. Been with Henry Rzepa today
On Wed, 20 Nov 2019, 10:48 Clyde Davies, notifications@github.com wrote:
How's this effort going? I haven't had much time to explore the handling of variable column names lately. I know it's going to be tricky.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSYXX562TYULETZULS3QUUIXPA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEERRW2I#issuecomment-555948905 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFTCS6GIJ2NJAHR5NHI4TDQUUIXPANCNFSM4JAOYUAQ
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMXXYEIR3X7TRCKMBHLQUVEHDA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESGH4A#issuecomment-556033008, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMWU44AZ4NQHF5VD7I3QUVEHDANCNFSM4JAOYUAQ .
-- Clyde
@deadlyvices has been exploring this and reporting in email.
ACTION: copy any relevant past emails here...