gilienv / EssOilDB

Restructuring of Essential Oil Database
Apache License 2.0
8 stars 6 forks source link

Understanding and documenting tables in EssoilDB1.0 #83

Open petermr opened 4 years ago

petermr commented 4 years ago

We need to understand what the E1.0 data is and how it is being transferred to E2.0.

I will try to analyze E1.0 and ask questions to where I am unclear. And anyone can ask as well.

petermr commented 4 years ago

As far as I can see E1.0 consists of exactly 2 tables (the names have lexical variants)

sample tables

I have created two sample tables with 100 rows for viewing on Github where the # separators are replaced by tabs. Github does a good job of tabulating as long as there are not too many rows.

We should take these as exemplars for our discussion.

original and derived data

Distinguish between data that can only be found in the original paper and data derived from that. Thus compound is probably copied from the article, while activity can be looked up by knowing the compound name. Derived data is useful for searching but should not be included in original tables as there is a maintenance problem.

petermr commented 4 years ago

compound table (infoc*, etc.)

See https://github.com/gilienv/EssOilDB/blob/master/tables/info_c100.tsv for example. The table info_c.csv has 142703 records and is highly denormalised. It consists of a single record, microparsable with # separators. When these are removed we get these columns (I have added the names, which we should agree).

    key
    compound
    cas
    percent
    plantpart
    analytical
    formula
    chemclass
    activity
    Planttype
    Normal??
    systemname

key

A semantic key, microparsable. It is the only information linking the two tables. Its form is defined in infop table.

action

keep and check normalization (appears to be cases-insensitive).

compound

A compound name. This seems to be the primary means of identifying the compound. It is usually a trivial name, requiring lookup, though OPSIN knows quite a few. If it cannot be looked up then the compound is unknown to the system.

action

keep and resolve compound by lookup.

cas

Chemical abstracts identifier. no record of whether this is original, hence confirmatory, or derived.

action

keep and resolve compound by lookup. May require manual search.

percent

The measured amount of compound as percent of total. (?mass, ?moles, ?AUC).

action

keep. normalize syntax, including ranges.

plantpart

The plant part. May be 1..* (not sure of separators). Enumerated ("leaf", etc.). Incorporated in "key".

action

keep and normalize into table.

analytical

Freeform method of analysis.

action

keep and normalize into table where possible.

formula

constituent chemical formula. Unknown whether original or derived.

action

keep if original, else transfer to search.

chemclass

class of compound. Derived.

action

transfer to search.

activity

activity of compound. Almost certainly derived.

action

transfer to search.

Planttype

enumeration: plant|weed.

action

discard as context dependent.

Normal??

Unknown.

action

Probably discard

systemname

Systematic chemical name. Origin unknown. If original, valuable as check. If derived of no value.

action

reserve judgment

petermr commented 4 years ago

Future compound tables

There should be

compound table

Table of chemical identities, names, formulae and identifiers but no properties. Likely columns:

chemId

e.g. C1234 require uniqueness

name

original chemical name

name_clean

syntactically cleaned name

name_lookup_pubchem

compound looked up in pubchem

name_lookup_chebi

compound looked up in ChEBI

name_lookup_cas

compound looked up in CAS etc... I will add

cas

original CAS no

formula

original formula

systematic_name

original systematic name

systematic_opsin

InChI generated by OPSIN

errors

errors from lookup tools.

There will be more columns during this resolution period. Later some will be dropped (or a smaller compund table generated)