petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Lookup unknown compounds in wikidata #62

Closed petermr closed 4 years ago

petermr commented 4 years ago

compounds in oil186 not in dictionary

https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/notFound.txt is a list of "new" compounds in oil186 papers.

create a 2-column table of these names and the wikidata identifiers where found.

(do not try to look up the explicit chemical names such as

tricyclo [4.4.0.0(2,7)] dec-3-ene-3-methanol, 1-methyl-8-(1-methylethyl)-
ambarishK commented 4 years ago

Sir, please go through the file for wikidata id - notFoundCompWIKIDATAPubChem.tsv.

Column description.

ambarishK commented 4 years ago

Sir, please review the sheet for WIKIDATA lookup - notFoundCompWIKIDATAPubChem.tsv

Please form a template for PubChem lookup of compounds - What columns are to be included onto the sheet?

We have two available services -

Identifier exchange services will provide CIDs OR InChIs OR InChIKeys OR SMILES OR Synonyms. Download services will use retrieved CIDs and provide IUPAC name AND Synonyms AND InChIKey etc.

PubChem lookup will be two step process - one for retrieving CIDs from identifier exchange services and next one for getting compound lookup based on retrieved CIDs through download services to get additional information about the compound.

ambarishK commented 4 years ago

Sir, revised sheet for WIKIDATA identifier addition - notFoundCompWIKIDATAPubChem.tsv.

ambarishK commented 4 years ago

Sir, revised sheet for WIKIDATA and PubChem lookup- notFoundCompWIKIDATAPubChem.tsv

I have cleaned compound names to support PubChem lookup.

**replace Greek letters with alpha-numeric characters.** e.g α -> alpha, β -> beta,γ -> gamma, δ -> delta etc.

Compound Example - 

α-cedrene -> alpha-cedrene  (PubChem CID - 6431015) 
γ-eudesmol -> gamma-eudesmol (PubChem CID - 6432005)
β-gurjunene -> beta-gurjunene (PubChem CID - 6450812) 

 **Isomeric notations made into capital letters.**  e.g (e,e) -> (E,E) ; (2e, 6z) -> (2E,6Z) etc.

Compound Example -

(z)-α-santalol -> (Z)-alpha-santalol ( PubChem CID - 11085337)
(e)-2-isopropyl-5-methylphenyl 2-methylbut-2-enoate -> (E)-2-Isopropyl-5-methylphenyl 2-methylbut-2-enoate (PubChem CID - 91698167)
(e)-β-ocimene -> (E)-beta-ocimene (Pubchem CID - 5281553) 

**Proper hyphen notation.**  (–) -> (-)

Compound Example - 

(−)-spathulenol -> (-)-spathulenol

**Trimming extra white spaces**

Extracted count of records - 465.

petermr commented 4 years ago

Well done.

This will help to normalize the names.

We can then make a new dictionary that can be used for searching the content of the tables in the articles.

This will probably be the next phase.

On Mon, Dec 2, 2019 at 8:52 AM Ambarish Kumar notifications@github.com wrote:

Sir, revised sheet for WIKIDATA and PubChem lookup- notFoundCompWIKIDATAPubChem.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/notFoundCompWIKIDATAPubChem.tsv

I have cleaned compound names to support PubChem lookup.

replace greek letters with alpha-numeric characters. e.g alpha, beta, gamma, delta etc.

Isomeric notations should be in capital letters. e.g (e,e) -> (E,E) ; (2e, 6z) -> (2E,6Z) etc.

Proper hyphen notation. (--) ->(-)

Extracted count of records - 465.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCSYMIDDU7TPWSFQ4TDDQWTEE3A5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFSXBWQ#issuecomment-560296154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZQXYWMLTAHMGKG7HLQWTEE3ANCNFSM4JSHK3RA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Yes sir. Next course of action is to go for normalization of the extracted compound names.

Please suggest to automate identification of compound synonyms.

petermr commented 4 years ago

At present I think the simplest is to include synonyms as entries in their own right , e.g.

That allows us to search for either term, but connect them later through Wikidata.

If we don't have Wikidata we should use PubChem and add the Wikidata entry as soon as possible.

Upper/lowercase is a problem. I think we should use the term to be completely lowercase but the name should include Uppercase where appropriate:

On Mon, Dec 2, 2019 at 10:07 AM Ambarish Kumar notifications@github.com wrote:

Yes sir. Next course of action is to go for normalization of the extracted compound names.

Please suggest to automate identification of compound synonyms.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCS5KNRZ74T2TRCIBYPDQWTM6RA5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFS6KGI#issuecomment-560325913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZHOAUIO23MUWVWVHLQWTM6RANCNFSM4JSHK3RA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir.

ambarishK commented 4 years ago

Sir, Please give an example for dictionary entry containing pubchem CID (instead of wikidata entry). Is it like <entry term="thymol" pubchem_cid="6989" />?

petermr commented 4 years ago

Yes that would be fine

On Mon, 2 Dec 2019, 10:45 Ambarish Kumar, notifications@github.com wrote:

Sir, Please give an example for dictionary entry containing pubchem CID (instead of wikidata entry). Is it like <entry term="thymol" pubchem_cid="6989" />?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCS3HCRGJ4Y36QPFETUTQWTRLFA5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTB4TY#issuecomment-560340559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5PGYJUVHD2P3UXVT3QWTRLFANCNFSM4JSHK3RA .

petermr commented 4 years ago

Yes, follow this strategy. Keep the "term" lowercase and the "name" uppercase.

On Mon, Dec 2, 2019 at 8:52 AM Ambarish Kumar notifications@github.com wrote:

Sir, revised sheet for WIKIDATA and PubChem lookup- notFoundCompWIKIDATAPubChem.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/notFoundCompWIKIDATAPubChem.tsv

I have cleaned compound names to support PubChem lookup.

replace greek letters with alpha-numeric characters. e.g alpha, beta, gamma, delta etc.

Isomeric notations should be in capital letters. e.g (e,e) -> (E,E) ; (2e, 6z) -> (2E,6Z) etc.

Proper hyphen notation. (--) ->(-)

Extracted count of records - 465.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCSYMIDDU7TPWSFQ4TDDQWTEE3A5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFSXBWQ#issuecomment-560296154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZQXYWMLTAHMGKG7HLQWTEE3ANCNFSM4JSHK3RA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

deadlyvices commented 4 years ago

Can I suggest that it might be useful to do a similarity analysis of all the compound names we now have? I've used KNIME to calculate Levenshtein distances in the past. It would help us identify mistakes, alternative spellings etc.

On Mon, 2 Dec 2019, 19:18 petermr, notifications@github.com wrote:

Yes, follow this strategy. Keep the "term" lowercase and the "name" uppercase.

On Mon, Dec 2, 2019 at 8:52 AM Ambarish Kumar notifications@github.com wrote:

Sir, revised sheet for WIKIDATA and PubChem lookup- notFoundCompWIKIDATAPubChem.tsv < https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/notFoundCompWIKIDATAPubChem.tsv

I have cleaned compound names to support PubChem lookup.

replace greek letters with alpha-numeric characters. e.g alpha, beta, gamma, delta etc.

Isomeric notations should be in capital letters. e.g (e,e) -> (E,E) ; (2e, 6z) -> (2E,6Z) etc.

Proper hyphen notation. (--) ->(-)

Extracted count of records - 465.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCSYMIDDU7TPWSFQ4TDDQWTEE3A5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFSXBWQ#issuecomment-560296154 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFTCSZQXYWMLTAHMGKG7HLQWTEE3ANCNFSM4JSHK3RA

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=ACM3QMVJZTN7JZZKBL6Y3ALQWVNPPA5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFUSW2Q#issuecomment-560540522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMUOQLCKTL6V6SZKQYLQWVNPPANCNFSM4JSHK3RA .

petermr commented 4 years ago

That's a good idea, but it probably comes later. It's probable that either/both Pubchem and Wikidata searches have Levenshtein or similar. And of course some names are very similar. The Levenshtein for propene and propane are very similar. It's most useful for typos and so farv these are rare.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, please go through the updated sheet for WIKIDATA and PubChem lookup -notFoundCompWIKIDATAPubChem.tsv

Added column for cleaned compound names - Cleaned_cmpnames

Compound    Cleaned_cmpnames    
butanoic acid       Butanoic acid   
β-linalool              beta-Linalool      
p-cymene         p-Cymene      
n-hexanol        n-Hexanol  
hexan-1-ol       Hexan-1-ol    
lubianat commented 4 years ago

Hello all,

I've been absent from the project for a while, sorry. I looked at the table and I can quickly make a script that:

This only for the already curated PubChem IDS. Would that be useful?

ambarishK commented 4 years ago

Hi, I have committed sheet compound_multiset.tsv. Please go through it and make needful changes.

I will like to have good conversation over framing automated SPRQL query.

ambarishK commented 4 years ago

Sir, Please go through the copy of compound_multiset.tsv

28 entries are not retrieved from PubChem.

For example -


1,3-di-p-coumaroylglycerol  | 2 |  not found  |  not found  | not found  |  not found |  

2-acetyl-1,3-di-caffeoylglycerol  | 2 |  not found |  not found |  not found |  not found |  

2-acetyl-1,3-di-p-coumaroylglycerol  | 2 |  not found |  not found |  not found |  not found |  

2-acetyl-p-3-coumaroyl-1-feruloylglycerol  | 2 |  not found  | not found |  not found |  not found |  

2-acetylo-3-caffeoyl-1-feruloylglycerol  | 2 |  not found  | not found  |  not found  |  not found |  

p-coumaric acid benzyl ester  | 2 |  (e)-p-coumaric acid  |  Q99374  |  not found  |  not found

92 entries are not retrieved from WIKIDATA.

For example -


p-cymen-7-ol  | 2 |  not found  |  not found  |  325  |  4-isopropylbenzyl alcohol |  

kaurene  | 2 |  not found  |  not found  |  520687  |  kaurene

khusinol  | 2 |  not found |  not found |  91746535  | khusinol

In batch retrieval of compound CIDs from PubChem identifier exchange services around 100 compound CIDs were left unretrieved (100 last most of entries). Those were retrieved manually.

For example -

p-cymenene
trans-gamma-bisabolene
trans-geraniol
trans-linalool oxide (furanoid)
trans-muurola-3,5-diene
trans-pinocamphone
trans-sabinol

More of the WIKIDATA lookups are based on compound synonyms ( not as of direct compound name).

For example -


calarene |  2  |  beta-Gurjunene  |  Q27154913  |  28481  |  beta-gurjunene

bisabolol  | 2 |  levomenol  | Q179896  |  1549992  |  bisabolol

There are isomers present into WIKIDATA lookup.

For example -


beta-chamigrene  | 2 |  (-)-beta-chamigrene  | Q27108622  |  442353  |  (-)-beta-chamigrene

beta-citronellol  | 2 |  (+/-)-.beta.-citronellol  |  Q27122080  |  8842  | citronellol

(-)-caryophyllene oxide  | 2 |  .beta.-caryophyllene oxide  | Q27136294  | 1742210  | caryophyllene oxide

selina-3,7(11)-diene  | 3 |  .alpha.-selinene  | Q7448480  | 10726905  | 7-epi-alpha-selinene
petermr commented 4 years ago

On Mon, Dec 9, 2019 at 7:37 PM Tiago Lubiana notifications@github.com wrote:

Hello all,

I've been absent from the project for a while, sorry.

Don't worry! great to see you back.

I looked at the table and I can quickly make a script that:

-

Looks for PubChem_CIDs in Wikidata via SPARQL and retrieve associated QIDs

If QID does not exist on Wikidata, tag the entry in the WIKIDATA_id column as NOT FOUND.

This only for the already curated PubChem IDS. Would that be useful?

Certainly ! The main, challenging, problem we face is synonymy. This can be simply

The synonyms serve several purposes.

At present we shouldn't worry too much about the model. The frequency of usage is a useful guide as to whether a broad term actually refers to something more specific. Thus "camphor" ( https://en.wikipedia.org/wiki/Camphor ) does not give an optical rotation, but https://pubchem.ncbi.nlm.nih.gov/compound/2537#section=Other-Experimental-Properties does. I think we can assume that "camphor" maps to "R-camphor" with very high probability.

We currently have extracted about 620 names (occurring at least 2 times to minimise author typos) from the tables in 1000 papers (oil1000) . Ambarish has normalised some of the syntax (e.g. greek letters). To use this dictionary for searching and lookup we use the WikidataID as the reference. I will use the Wikidata and pubchem name as additional search terms and then create a search dictionary.

Then we can see which additional terms need resolving against Wikidata.

We also need to start doing this on activities (Manny is working on this) and probably also organisms (both targets and plants).

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCS4XEZGLAOQOU5JBL53QX2NA5A5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGKMYCY#issuecomment-563399691, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5MGSYYLNQBCFIJM7DQX2NA5ANCNFSM4JSHK3RA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

On Wed, Dec 11, 2019 at 11:50 AM Ambarish Kumar notifications@github.com wrote:

Sir, Please go through the copy of compound_multiset.tsv https://github.com/petermr/CEVOpen/blob/master/searches/oil1000/__tables/compound_multisetCopy.tsv

28 entries are not retrieved from PubChem.

or Wikidata

We will need to generate InChI using OPSIN . There needs to be an InChI field

For example -

1,3-di-p-coumaroylglycerol | 2 | not found | not found | not found | not found |

Use opsin.ch.cam.ac.uk to translate this to InChI StdInChIKey:

KBVRKOFXCCIDFX-YDWXAUTNSA-N http://www.google.com/search?q=%22KBVRKOFXCCIDFX-YDWXAUTNSA-N%22 (Click to search the internet for this structure)

Then search using this and find: https://pubchem.ncbi.nlm.nih.gov/compound/14034127

p-coumaric acid benzyl ester | 2 | (e)-p-coumaric acid | Q99374 | not found | not found

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is not a correct match... InChI is RGZZCZQQPNJCPO-DHZHZOJOSA-N http://www.google.com/search?q=%22RGZZCZQQPNJCPO-DHZHZOJOSA-N%22 (Click to search the internet for this structure) => Pubchem https://pubchem.ncbi.nlm.nih.gov/compound/E_-Benzyl-3-_4-hydroxyphenyl_acrylate PubChem CID: 10083644

92 entries are not retrieved from WIKIDATA.

For example -

p-cymen-7-ol | 2 | not found | not found | 325 | 4-isopropylbenzyl alcohol |

Agreed - needs adding

kaurene | 2 | not found | not found | 520687 | kaurene

Agreed - we need to add it.

khusinol | 2 | not found | not found | 91746535 | khusinol

In batch retrieval of compound CIDs from PubChem identifier exchange services https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi around 100 compound CIDs were left unretrieved (100 last most of entries). Those were retrieved manually.

So these Are neither in Pubchem or Wikidata

For example -

p-cymenene

I find in Pubchem: 1195-32-0; Dehydro-P-Cymene; 1-Methyl-4-(Prop-1-En-2-Yl)Benzene; 2-P-Tolylpropene; 4-Methylisopropenylbenzene; 4-Isopropenyltoluene; 2-(P-Methylphenyl)Propene; P-Cymenene; ... https://pubchem.ncbi.nlm.nih.gov/compound/62385 Compound CID: 62385 https://pubchem.ncbi.nlm.nih.gov/compound/62385 MF: C10H12 https://pubchem.ncbi.nlm.nih.gov/search/#query=C10H12 MW: 132.2g/mol

trans-gamma-bisabolene

I find in pubchem rans-.Gamma.-Bisabolene; (E)-.Gamma.-Bisabolene https://pubchem.ncbi.nlm.nih.gov/substance/10544203 Substance SID: 10544203 https://pubchem.ncbi.nlm.nih.gov/substance/10544203 Compound CID: 6428434 https://pubchem.ncbi.nlm.nih.gov/compound/6428434 Data Source: NIST Chemistry WebBook https://pubchem.ncbi.nlm.nih.gov/source/NIST%20Chemistry%20WebBook

I am sure the others resolve

More of the WIKIDATA lookups are based on compound synonyms ( not as of direct compound name).

For example -

calarene | 2 | beta-Gurjunene | Q27154913 | 28481 | beta-gurjunene

bisabolol | 2 | levomenol | Q179896 | 1549992 | bisabolol

There are isomers present into WIKIDATA lookup.

For example -

beta-chamigrene | 2 | (-)-beta-chamigrene | Q27108622 | 442353 | (-)-beta-chamigrene

beta-citronellol | 2 | (+/-)-.beta.-citronellol | Q27122080 | 8842 | citronellol

(-)-caryophyllene oxide | 2 | .beta.-caryophyllene oxide | Q27136294 | 1742210 | caryophyllene oxide

selina-3,7(11)-diene | 3 | .alpha.-selinene | Q7448480 | 10726905 | 7-epi-alpha-selinene

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCS7KXCDMH73FOY5IXKLQYDHYBA5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGS23OA#issuecomment-564506040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6VMOY2VXE2RNCT7PDQYDHYBANCNFSM4JSHK3RA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

lubianat commented 4 years ago

Hello,

I have made a pull request (https://github.com/petermr/CEVOpen/pull/69) with the Wikidata matches for PubChemIDs in the compound_multisetCopy.tsv table.

I added 20 new QIDs.

Two notes: 1 - Some PubChem CIDs are duplicated the table. Ex: cis-calamenene (6429077) and calamenene (also 6429077). trans-calamenene is 6429022. I have not changed anything regarding the duplications. It is a consequence of synonymy, as said above.

2 - I was going to auto add the missing compounds to Wikidata, but I did not want to be too hasty. There is a nice Wikiproject focused on chemical IDs (https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/ChemID), I am planning to go through their docs as soon as possible.

The code is also added to a folder. As this is my first contribution in a while, I do not know, is there something I should have done differently?

petermr commented 4 years ago

Suggest you copy egon.willighagen@gmail.com - Egon will give good advice

On Fri, Dec 13, 2019 at 9:17 PM Tiago Lubiana notifications@github.com wrote:

Hello,

I have made a pull request (#69 https://github.com/petermr/CEVOpen/pull/69) with the Wikidata matches for PubChemIDs in the compound_multisetCopy.tsv table.

I added 20 new QIDs.

Two notes: 1 - Some PubChem CIDs are duplicated the table. Ex: cis-calamenene (6429077) and calamenene (also 6429077). trans-calamenene is 6429022. I have not changed anything regarding the duplications. It is a consequence of synonymy, as said above.

2 - I was going to auto add the missing compounds to Wikidata, but I did not want to be too hasty. There is a nice Wikiproject focused on chemical IDs (https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/ChemID), I am planning to go through their docs as soon as possible.

The code is also added to a folder. As this is my first contribution in a while, I do not know, is there something I should have done differently?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/62?email_source=notifications&email_token=AAFTCS4ICR2LVHQ7FK3LX7TQYP3XNA5CNFSM4JSHK3RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG3IZUQ#issuecomment-565611730, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5BYI4OSNTV7XR3YT3QYP3XNANCNFSM4JSHK3RA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK