gilienv / EssOilDB

Restructuring of Essential Oil Database
Apache License 2.0
8 stars 6 forks source link

Overview of `info_compound` #86

Open petermr opened 4 years ago

petermr commented 4 years ago

I do not fully understand info_c - maybe only Gita can help. My copy (in the repo) has a number of rows with repeated keys, such as:

JEAabsp1999Lea  (e)-2-hexenal   6728-26-3   0.1 5   leaf    GC, GC/MS   C6H10O  aldehyde    NA  Plant   Chemotype A (e)-hex-2-enal,
JEAabsp1999Lea  (e)-2-hexenal   6728-26-3   <0.03   leaf    GC, GC/MS   C6H10O  aldehyde    NA  Plant   Chemotype B (e)-hex-2-enal,
JEAabsp1999Lea  (e)-beta-ocimene    3779-61-1   <0.03   leaf    GC, GC/MS   C10H16  monoterpene insect attractant   Plant   Chemotype A (3e)-3,7-dimethylocta-1,3,6-triene,
JEAabsp1999Lea  (e)-beta-ocimene    3779-61-1   0.04    leaf    GC, GC/MS   C10H16  monoterpene insect attractant   Plant   Chemotype B (3e)-3,7-dimethylocta-1,3,6-triene,
JEAabsp1999Lea  (e)-nerolidol   40716-66-3  <0.03   leaf    GC, GC/MS   C15H26O sesquiterpenol  skin penetration enhancer   Plant   Chemotype B (3s,6e)-3,7,11-trimethyldodeca-1,6,10-trien-3-ol,
JEAabsp1999Lea  (e)-nerolidol   40716-66-3  0.06    leaf    GC, GC/MS   C15H26O sesquiterpenol  skin penetration enhancer   Plant   Chemotype A (3s,6e)-3,7,11-trimethyldodeca-1,6,10-trien-3-ol,

Here the rows differ ONLY by the concentration, i.e. the same compound has more than one values in the profile.

I don't know how the data ws collected, but this suggests that two profiles for the same key (jrnl-plant-loc-year-part) have been entered indepedently. This suggests there is another variable that is not recorded in the table (maybe time/date)? Wherever it comes from it makes it impossible to describe the profile accurately.

I have no idea how common this is or whether it is cost-effective to try to "correct" it.

gilienv commented 4 years ago

Thanks Peter - these are two different profiles. One (Top one in each pair above) is from Chemotype A of the same plant The other (Lower one) is from Chemotype B

This distinction is recorded in one of the columns - see second last

The confusion is because you see the same profile code (Col 1) The profile code is created by a software at present - a small algorithm. Often the same code denotes two or more profiles, but the data is NOT redundant, since there are differences in other fields like chemotype or condition or sth else.

The V1.0 of EssoilDB was never searching for profiles, hence it did not care about this.

In V2.0 we are trying to be very careful about assigning IDs to each profile.

  1. There are not many cases like this
  2. This can be sorted - no need to remove/overlook/delete data

Hope this helps

petermr commented 4 years ago

Thanks, My oversight, sorry. Good to see the individual IDs being developed for profiles. I think there will be issues when many profiles are reported in a single table and I'll flag these when I see them.

petermr commented 4 years ago

for @ambarishK Have a very good look at what @Shruthi-M has done for plants - it's a good model. You need to have a similar table for compounds:

Currently I suggest:

I think these are the only columns which identify the chemical substance.

Suggest 3 more columns:

If columns 2, 3, and 4 agree then the record is marked as accepted.

If they don't, we will have to discuss and agree on the resolution.

We will find synonyms, but they will be harder than plants.

We can discuss these tomorrow.