petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Dictionary Validation — Beyond Synonyms: Hyphens and Suffixes and Layman Terms, oh my! #87

Open EmanuelFaria opened 3 years ago

EmanuelFaria commented 3 years ago

@petermr et al.... In thinking about validating dictionaries, a bunch of questions popped into my head (right before bed, as usual):

1. How will AMI handle terms that are just as often often found with and without hyphens? Currently, our dictionaries include hyphenated and non-hyphenated terms as entries/rows. But now that we're talking about collecting synonyms in a new "field" in the same record, how do we treat these preferences/abberations?

I imagine we could decide to "go with hyphens" as the default and hard-code AMI to handle replacements automatically by treating each occurrence having a hyphen:

This could work, but... a) we'd have to be pretty confident we pasted in the default hyphen everywhere they could/should be one, and b) we'd also need to ensure that all "hyphen-having chemical compounds" are treated in accordingly... er, respectfully... like the lady or gentlemen molecule worthy of respect they no doubt they are.

As a side-note, I believe EUPMC's browser search treats quoted+hyphenated "multi-word terms" (kind of like that one right there) the same. That is, within quotes, it treats words separated by a hyphen the same as those separated by a space — but it treats terms with no space as different terms altogether. So the question is: W.W.A.D.? (Would Will AMI do?)

2. Suffixes as synonyms? Should we account for all possible word endings? What about plural versions? Will the addition of an "s" or "es" at the end of a term affect the results? (@petermr, If you want to code AMI to handle affixes automatically, I have a clean list of them ready to go for you! And — oh boy! — it would feel great to know that time I spent collecting, cleaning, and organizing them wasn't "wasted" on learning something new again). 🤓

3. Layman's Terms/Names... synonyms or not? Using plant names as an example, will we be treating plant common names (Ceder Leaf Oil) as a synonym for their botanical names (Thuja Occidentalis)? If not, we'll also need to consider that some common names (fruits for example), will be different among countries or regions ... and then there's the whole "many fruits vs. single fruits" issue ... and hyphenated fruits too, I suppose... 🤔🤯