Open petermr opened 4 years ago
As far as I can see E1.0 consists of exactly 2 tables (the names have lexical variants)
infoc*
compoundsinfop*
plantsI have created two sample tables with 100 rows for viewing on Github where the #
separators are replaced by tabs. Github does a good job of tabulating as long as there are not too many rows.
We should take these as exemplars for our discussion.
Distinguish between data that can only be found in the original paper and data derived from that. Thus compound
is probably copied from the article, while activity
can be looked up by knowing the compound name. Derived data is useful for searching but should not be included in original tables as there is a maintenance problem.
See https://github.com/gilienv/EssOilDB/blob/master/tables/info_c100.tsv for example. The table info_c.csv
has 142703 records and is highly denormalised. It consists of a single record, microparsable with # separators. When these are removed we get these columns (I have added the names, which we should agree).
key
compound
cas
percent
plantpart
analytical
formula
chemclass
activity
Planttype
Normal??
systemname
A semantic key, microparsable. It is the only information linking the two tables. Its form is defined in infop
table.
keep and check normalization (appears to be cases-insensitive).
A compound name. This seems to be the primary means of identifying the compound. It is usually a trivial name, requiring lookup, though OPSIN knows quite a few. If it cannot be looked up then the compound is unknown to the system.
keep and resolve compound by lookup.
Chemical abstracts identifier. no record of whether this is original, hence confirmatory, or derived.
keep and resolve compound by lookup. May require manual search.
The measured amount of compound as percent of total. (?mass, ?moles, ?AUC).
keep. normalize syntax, including ranges.
The plant part. May be 1..* (not sure of separators). Enumerated ("leaf", etc.). Incorporated in "key".
keep and normalize into table.
Freeform method of analysis.
keep and normalize into table where possible.
constituent chemical formula. Unknown whether original or derived.
keep if original, else transfer to search.
class of compound. Derived.
transfer to search.
activity of compound. Almost certainly derived.
transfer to search.
enumeration: plant|weed.
discard as context dependent.
Unknown.
Probably discard
Systematic chemical name. Origin unknown. If original, valuable as check. If derived of no value.
reserve judgment
There should be
Table of chemical identities, names, formulae and identifiers but no properties. Likely columns:
e.g. C1234 require uniqueness
original chemical name
syntactically cleaned name
compound looked up in pubchem
compound looked up in ChEBI
compound looked up in CAS etc... I will add
original CAS no
original formula
original systematic name
InChI generated by OPSIN
errors from lookup tools.
There will be more columns during this resolution period. Later some will be dropped (or a smaller compund table generated)
We need to understand what the E1.0 data is and how it is being transferred to E2.0.
I will try to analyze E1.0 and ask questions to where I am unclear. And anyone can ask as well.