petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Normalization of Chemical Compound Names #66

Closed EmanuelFaria closed 4 years ago

EmanuelFaria commented 4 years ago

The compound names individually typed by article authors into the thousands of tables we pull will have variations and typos that will need to normalized. (Some examples below to be immediately considered and normalized below.

@petermr please define rules for such normalizations, and the method of automation of future replacements and corrections.

Examples to be decided/corrected now — which may also indicate other such manual-entry normalization targets: Looking at compound_multiset.tsv, I found a couple of things to make decisions on regarding normalization:

  1. See column 3 (WIKIDATA_query) for lines 4 (Limonene) and 5 (linolool): Do we normailze to +/- or ± (Out of curiosity, do we prefer what we expect users might type into a search field, or what takes fewer characters? Or Is the latter (±) a "special character" that may not display or search properly on some machines or systems)?

  2. Line 7 column 3 = γ-terpinene ... (should be gamma, I think) Question: If this is not a typo, but a potential recurring problem resulting from the use of a "special character?" do we add this as sort of a false synonym?

  3. also look at line 33 to replace gamma in (+)-γ-cadinene

  4. Line 22: (-)-α-phellandrene .... I believe we're spelling out greek characters, such as alpha. There are many more to find and replace

  5. At the end of each pull, automatically find and replace "-(space)" [lines 402, 552, 353] and "(space-)" [none found in this set] with "-" no space

petermr commented 4 years ago

You should put more details into Issues. You have assigned me a task but I have no idea what to do. I also expect that this issue could be included in some of the open issues

EmanuelFaria commented 4 years ago

You should put more details into Issues. You have assigned me a task but I have no idea what to do. I also expect that this issue could be included in some of the open issues

@petermr Done. Please re-open issue if corrected to your satisfaction.