gilienv / EssOilDB

Restructuring of Essential Oil Database

Apache License 2.0

8 stars 6 forks source link

Disambiguation of plant names using GBIF #82

Open Shruthi-M opened 4 years ago

Shruthi-M commented 4 years ago

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

occurrenceId
verbatimScientificName (user-submitted name)
scientificName (name existing in the database)
key (unique number assigned to the particular species on GBIF
matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
- EXACT means the name exactly matches with the entry in the database
- FUZZY indicates entries that may be mis-spelt
- HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
confidence (expressed in terms of percentage)
status (can be ACCEPTED, SYNONYM or DOUBTFUL)
- DOUBTFUL Treated as accepted, but doubtful whether this is correct.
- SYNONYM A general synonym, the exact type is unknown.
rank (the highest rank recognized)
kingdom
phylum
class
order
family
genus
species

petermr commented 4 years ago

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

On Thu, 25 Jul 2019, 15:13 Shruthi-M, notifications@github.com wrote:

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

occurrenceId

verbatimScientificName (user-submitted name)

scientificName (name existing in the database)

key (unique number assigned to the particular species on GBIF

matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

EXACT means the name exactly matches with the entry in the database

FUZZY indicates entries that may be mis-spelt

HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)

confidence (expressed in terms of percentage)

status (can be ACCEPTED, SYNONYM or DOUBTFUL)

DOUBTFUL Treated as accepted, but doubtful whether this is correct.

SYNONYM A general synonym, the exact type is unknown.

rank (the highest rank recognized)

kingdom

phylum

class

order

family

genus

species

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A .

petermr commented 4 years ago

Shruthi , This is excellent. We should preserve this table Then we will normalize. We should classify the results into the major groups. Initial comments:

occurrenceId verbatimScientificName scientificName key matchType confidence
status rank kingdom phylum class order family genus species

occurrenceId // these were all blank, so we can drop this verbatimScientificName // this is our initial raw data and must be preserved. Let's use GBIF terminology where possible, so keep this column name scientificName // the preferred name. Can include synonyms. We should not use this if there is a species

key // this is the most important column and gives us all the normalized information we need matchType // Yes, we should keep this because it helps understand non-normalized species confidence // whats' the lowest? I think we can drop this later status // useful for non-normalized names rank // useful for non-normalized names kingdom phylum class order family genus // probably keep. GBIF seems to map unknown species to genus. species // the key normalization

Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae
Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima

^^ duplicates. Why? we can get rid of these immediately

Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM
SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea
Achillea tenuifolia

This is a single synonym but we should use the species name "Achillea tenuifolia" for future matching, not the scientificName "Achillea albicaulis". As always the key is the critical column.

,"Ocimum sanctum","Ocimum sanctum
L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
,"Ocimum tenuiflorum","Ocimum tenuiflorum
L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"

These are synonyms but have different keys. So in our normalized table there should only be the ACCEPTED. Normalization should be on "species"

Let's summarize and make a list of ACCEPTED species

SYNONYMS can be removed if there is an ACCEPTED species SYNONYM without ACCEPTED equivalent should be normalized on the species

Everything else shouldbe separated out as we will have to discuss it.

Well done.

On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

On Thu, 25 Jul 2019, 15:13 Shruthi-M, notifications@github.com wrote:

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

occurrenceId

verbatimScientificName (user-submitted name)

scientificName (name existing in the database)

key (unique number assigned to the particular species on GBIF

matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

EXACT means the name exactly matches with the entry in the database

FUZZY indicates entries that may be mis-spelt

HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)

confidence (expressed in terms of percentage)

status (can be ACCEPTED, SYNONYM or DOUBTFUL)

DOUBTFUL Treated as accepted, but doubtful whether this is correct.

SYNONYM A general synonym, the exact type is unknown.

rank (the highest rank recognized)

kingdom

phylum

class

order

family

genus

species

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

We are going to need unique identifiers for all these accepted species, e.g. PL123 we need a column for EssoilDB plant key.

On Thu, Jul 25, 2019 at 9:29 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Shruthi , This is excellent. We should preserve this table Then we will normalize. We should classify the results into the major groups. Initial comments:
occurrenceId verbatimScientificName scientificName key matchType
confidence status rank kingdom phylum class order family genus species
occurrenceId // these were all blank, so we can drop this verbatimScientificName // this is our initial raw data and must be preserved. Let's use GBIF terminology where possible, so keep this column name scientificName // the preferred name. Can include synonyms. We should not use this if there is a species

key // this is the most important column and gives us all the normalized information we need matchType // Yes, we should keep this because it helps understand non-normalized species confidence // whats' the lowest? I think we can drop this later status // useful for non-normalized names rank // useful for non-normalized names kingdom phylum class order family genus // probably keep. GBIF seems to map unknown species to genus. species // the key normalization
Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae
Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
^^ duplicates. Why? we can get rid of these immediately
Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM
SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea
Achillea tenuifolia
This is a single synonym but we should use the species name "Achillea tenuifolia" for future matching, not the scientificName "Achillea albicaulis". As always the key is the critical column.
,"Ocimum sanctum","Ocimum sanctum
L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
,"Ocimum tenuiflorum","Ocimum tenuiflorum
L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
These are synonyms but have different keys. So in our normalized table there should only be the ACCEPTED. Normalization should be on "species"

Let's summarize and make a list of ACCEPTED species

SYNONYMS can be removed if there is an ACCEPTED species SYNONYM without ACCEPTED equivalent should be normalized on the species

Everything else shouldbe separated out as we will have to discuss it.

Well done.

On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

On Thu, 25 Jul 2019, 15:13 Shruthi-M, notifications@github.com wrote:

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

occurrenceId

verbatimScientificName (user-submitted name)

scientificName (name existing in the database)

key (unique number assigned to the particular species on GBIF

matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

EXACT means the name exactly matches with the entry in the database

FUZZY indicates entries that may be mis-spelt

HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)

confidence (expressed in terms of percentage)

status (can be ACCEPTED, SYNONYM or DOUBTFUL)

DOUBTFUL Treated as accepted, but doubtful whether this is correct.

SYNONYM A general synonym, the exact type is unknown.

rank (the highest rank recognized)

kingdom

phylum

class

order

family

genus

species

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

@Shruthi-M - this is so important for all of us! Everyone should read this thread. I'll annotate @Shruthi-M 's table and add actions. We should extract the discussion here onto *.md pages as well. There are fundamental issues which apply to compounds as well, @ambarishK . I hadn't seen this clearly until I started on the poster. We are only dealing at present with converting exisiting EssoilDB 1.0 (E1.0) to E2.0 (i.e not worrying about ingesting new data into either).

= Origin of data = It's critical to review exactly where the data comes from. After talking with @gilienv yesterday I believe that :

all the independent data are in "infopdata" and "infocdata"
there is probably not much original documentation
there may be extra information in legacy *.xls files,

== ACTION == We have to agree and then document what is in EssoilDB 1.0

Shruthi-M commented 4 years ago

Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week). ISSUES BEING FACED:

Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”.
Kunzea ambigua [pid: 879] - The literature from which this is taken - was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database.
Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated.
Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further.
There are 11 binomial names that are shown to be DOUBTFUL.
The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this.
A final list of accepted species (i.e. not synonyms) has to be prepared.
This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary.

petermr commented 4 years ago

Thanks so much Shruthi, One feature of data is that there is always a "long tail". https://en.wikipedia.org/wiki/Long_tail . A few items that can't be easily processed. The most important thing at present is to resolve the largest chunks of names as effeciently as possible. I'll try to highlight a strategy today based on the very useful output from GBIF you created. If there is a species that occurs only once and we can't resolve it, compared with one that occurs 10 times and we can, we prioritise the latter.

On Sat, Jul 27, 2019 at 8:54 AM Shruthi-M notifications@github.com wrote:

Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week). ISSUES BEING FACED:

Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”.

Kunzea ambigua [pid: 879] - The literature from which this is taken

was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database.

Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated.

Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further.

There are 11 binomial names that are shown to be DOUBTFUL.

The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this.

A final list of accepted species (i.e. not synonyms) has to be prepared.

This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCSYLELWVYH6E6YTFLT3QBP5MPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26GLRA#issuecomment-515663300, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZGSJGU4EUOZXUEXU3QBP5MPANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

I shall add comments on your very useful output (which can be viewed directly in table form on Github): https://github.com/gilienv/EssOilDB/blob/master/tables/plant/gbif_result.tsv

Note that there is no EssoilDB ID for each row so I shall refer to this table in unsorted fashion. This is why we need an ID!

It has 1839 rows (1 header and 1838 data). ACTION does this number agree with other info_plant tables?
GBIF has returned an identifier (column 4) for each row. As far as I can see every row has an identifier, so there is nothing that GBIF cannot interpret in some way. ACTION are there any names elsewhere in V1.0 that GBIF cannot interpret?
There are exactly duplicated rows (e.g. 7 and 8 and more elsewhere). "Remove duplicates" in Excel gives: 129 duplicates: ACTION after adding UniqueIds, remove all duplicates from table.

petermr commented 4 years ago

After deduplication here are my comments.

row 2
```
Abies alba  Abies alba Mill.    2685484     EXACT   99  ACCEPTED    SPECIES     Plantae     Tracheophyta    Pinopsida   Pinales     Pinaceae    Abies   Abies alba
```
Input is Abies alba , GBIF found key=2685484 as an EXACT match with 99% confidence a a SPECIES, [taxonomy omitted] and the normative species name as Abies alba. It added the authority as Abies alba Mill. but this is probably more detail than we need. So our result is
```
Abies alba => Abies alba (GBIF 2685484) [SPECIES CONFIRMED]
```
Most of the results are (happily) of this form.

row 6

Acacia caven    Acacia caven (Molina) Molina    2979244     EXACT   99  SYNONYM     SPECIES     Plantae     Tracheophyta    Magnoliopsida   Fabales     Fabaceae    Vachellia   Vachellia caven

Input is Acacia caven EXACTly identified, but this a synonym for the preferred name Vachellia caven

Vachellia caven is not mentioned in V1.0 so we lookup Vachellia caven in GBIF to give https://www.gbif.org/species/3795588. We then use Vachellia caven (GBIF 3795588) as the accepted name with Acacia caven (GBIF 2979244) as a synonym. ACTION agree this strategy.

row 16/17

Achillea beibersteinii     Achillea beibersteinii Afan.    7400456     EXACT   98  DOUBTFUL    SPECIES     Plantae     Tracheophyta    Magnoliopsida   Asterales   Asteraceae  Achillea    Achillea beibersteinii
Achillea biebersteinii  Achillea biebersteinii C.Afan.  3120276     EXACT   98  SYNONYM     SPECIES     Plantae     Tracheophyta    Magnoliopsida   Asterales   Asteraceae  Achillea    Achillea arabica

I am guessing what has happened here is that beibersteinii is a misprint in the general plant literature (hence category DOUBTFUL), but it has got into the official books. So this should be referred to our curator. I would expect that we'd normalize it to Achillea biebersteinii which is a SYNONYM for Achillea arabica (which should be our agreed normative species).

row 25
```
Achillea depressa   Achillea L.     3119995     HIGHERRANK  96  ACCEPTED    GENUS   Plantae     Tracheophyta    Magnoliopsida   Asterales   Asteraceae  Achillea    
```
Here GBIF cannot find an exact match, so reverts to the Genus. This is a loss of information, so maybe we should search elsewhere. "Plants of the world" (Kew) gives: http://plantsoftheworldonline.org/taxon/urn:lsid:ipni.org:names:173942-1 "

Achillea depressa Janka
```
This is a synonym of Achillea pseudopectinata Janka
```
" So we can probably add the relatively few examples by hand. There are 63 GENUS and 9 KINGDOM rows of which about 20 are either Foobar spp. and so not reconcilable. The other ~40 can be searched by hand and added by curator.

row 55

Aframomum hanburyl  Aframomum hanburyi K.Schum.     2758831     FUZZY   96  SYNONYM     SPECIES     Plantae     Tracheophyta    Liliopsida  Zingiberales    Zingiberaceae   Aframomum   Aframomum angustifolium

FUZZY means that there is probably a misprint (here hanburyl for hanburyi). In this case the accepted name is also a SYNONYM, so there is a further step to Aframomum angustifolium

recommendation

create new columns:

original    GBIFAcceptedName  GBIFIdentier  GBIFSynonyms  curationDetails

The original is presereved
The best accepted name is always given
the identifier for that name is always given
If there is one or more accepted synonyms in V1.0 list them
log curation details (date, curator, action). Action can be: TYPO, SYNONYM, GENUS

petermr commented 4 years ago

list of problem species

Shruthi has created a report with a number of problems of names. She has actually gone back to priginal papers. Suggests she copy the data here.

I have also found some problems which seem to be different, and add some suggestions.

species with unusual synonyms or mapping onto more than one species.

Requires hand editing

Achillea depressa
Achillea stricta
Achillea tanacetifolia
Aloysia triphylla
Anthemis altissima
Artemisia coerulescens
Artemisia fragrans
Artemisia gallica
Artemisia herba-alba
Athrotaxis taxifolia
Cedrus liobani
Chenopodium ambrosioides
Cinnamomum fragrans
Cinnamomum zeylanicum
Coleus Aromaticus
Dracocephalum speciosum
Echinophora chysantha
Eclipta indica
Eryngium caeruleum
Eucalyptus viridiflora
Eugenia nitida
Eugenia ovalifolia
Eugenia rotundifolia
Lavandula hybrida
Lindera strychnifolia
Lippia gracillis
Mentha gracilis
Micromeria dalmatica
Nepeta fissa
Ocimum adscendens
Oenanthe divaricata
Origanum basilicum
Origanum micranthum
Pinus laricio
Pluchea purpurascens
Polymnia sonchifolia
Satureja viminea
Senecio farfarifolius
Stachys lanata
Tanacetum elburensis
Thymus capitatus
Thymus caucasicus
Thymus ciliates
Thymus hirtus

hybrids

Probably best represented at genus level

Citrus reticulata x Citrus sinensis
Citrus latifolia Tanaka x Citrus aurantifolia Swingle
Citrus paradisi x Citrus. reticulata
Citrus unshiu x Citrus nobilis
Eucalyptus citriodora x E.torelliana
Lavandula luisieri x Lavandula stoechas

and these are probably hybrids (assume the non-Unicode char is 'times' symbol.

Mentha •À_ piperita
Mentha•À_longifolia•À_L.
Peperomia•À_pellucida•À_L.

genus

These entries are only interpretable at genus level.

Astartea sp. nov.
Calamintha var.darensis
Callistemon sp.
Eryngium sp nov.
Eryngium spp.
Eugenia sp.
Hypericum 'Hidcote'
Kunzea sp.
Mentha spp.
Mikania sp.nov.
Origanum spp.
Persea
Xanthostemon spp.
Renealmia spp.

typos

Species require lowercase specific name.

Stachys Corsica
Tordylium Ketenoglui

unknown species

Lomatopodium khorassanicum
Serotinocarpum insignis

petermr commented 4 years ago

Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well. P.

petermr commented 4 years ago

Shruthi, When you have diambiguated (most) of the plant species can you lookup their IDs in Wikidata? I wrote a simple tool in Feb for the workshop, but it was a bit slow - had to lookup one-by-one. There may be better tools now - I can ask...

Shruthi-M commented 4 years ago

Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well. P.

Sir Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries. I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM". I am currently working on this. As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Thank you for your guidance.

petermr commented 4 years ago

On Mon, Jul 29, 2019 at 11:23 AM Shruthi-M notifications@github.com wrote:

Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well. P.

Sir Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries.

I can understand there is a lot to do.

I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM".

What is the purpose of SYNONYM? Is it for searching? In which case it can be automatically generated from the GBIF identifier when needed.

I am currently working on this. As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Understood. I will mail Gita.

Thank you for your guidance.

It is a pleasure to work with you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCSYX2KBHUBZYQNNHMX3QB3AIPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3AI4SQ#issuecomment-515935818, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZ2VQGKB2VCYSXRAGTQB3AIPANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 4 years ago

Greetings! I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.]

pid - as per EssoilDB 1.0
pname - as existing in EssoilDB 1.0
scientificName (gbif) - results obtained from GBIF
Normalized name
Details - about the author, subspecies, variety, etc.
pfid
phid
Error
kingdom
phylum
class
order
family
genus
species
Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma.

The entries that are modified/ need modification are in red.

The hybrids are yet to be resolved

petermr commented 4 years ago

Thanks Good to see this is a separate table. Will look later today

On Tue, 30 Jul 2019, 07:34 Shruthi-M, notifications@github.com wrote:

Greetings! I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.]

pid - as per EssoilDB 1.0

pname - as existing in EssoilDB 1.0

scientificName (gbif) - results obtained from GBIF

Normalized name

Details - about the author, subspecies, variety, etc.

pfid

phid

Error

kingdom

phylum

class

order

family

genus

species

Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma.

The entries that are modified/ need modification are in red.

The hybrids are yet to be resolved

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS2JP7VJJPCVLTYAPSTQB7OGLA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3C54CI#issuecomment-516283913, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2MPO62MZHZ37PYG3TQB7OGLANCNFSM4IG3DB3A .

Shruthi-M commented 4 years ago

I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata.

petermr commented 4 years ago

Many thanks!

On Thu, Aug 1, 2019 at 10:47 AM Shruthi-M notifications@github.com wrote:

I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCSYTW6XPTNSDUKQ23P3QCKWMFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3KATKQ#issuecomment-517212586, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS54SMRWJI5MGLLJCWLQCKWMFANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

@Shruthi-M This is wonderful! You have done a good job. Could you please add:

how you created the data in each column - I imagine some has been added from GBIF or other authority - please name the authority explicitly and the service you used (if you did it automatically).
please explain the coloured entries.
how did you perform the wikimedia look up? did you use a service? (I thought you said you only had 300 , but you've got about 85% I think?)

[Although you may write this in the report the people using the plant data may not have access, so make sure the doc is in the directory].

I have renamed the major table to ingestion and created a TSV version.

petermr commented 4 years ago

UNIQUE IDENTIFIERS for plants. Now is the time to freeze the number of entries being imported from V1.0. There are 1838 plant entries and you have generated a unique ID for each record. This ID must always be associated with the same record. If records are deleted we NEVER reuse that identifier. I think the identifiers should have a leading letter or more This has several advantages:

it protects against pre-truncation by mistake
it protects against using them as data (e.g. adding or subtraction)
it makes it clear they are identifiers
it make make them easier to find in google, etc.
It identifies them to the world as EssoilDB

So I suggest:

EPdddd for plants
ECdddd for compounds
ELdddd for locations etc.

The question is whether we create identifiers of fixed length, e.g. EP0001234 Since Wikidata and others don't I suggest we DONT worry about length.

EmanuelFaria commented 4 years ago

        >>Manny >Before Re-importing into the database, I’d like to get a shot at eliminating any invisible characters and othe anomalies please.---- On Thu, 01 Aug 2019 16:22:00 -0400

PMR>> Absolutely!!

The characters should ONLY be Unicode 32-126. We will test for that. All other characters must be mapped onto these.

Thus any beta-character => beta-
all quotes => " or '
all dashes => -
all typography and style is discarded

(BTW when replying to Github issues, try to eliminate all copy of previous posts, signatures, routing etc.)

petermr commented 4 years ago

I have renamed @Shruthi-M tables to tables/plant/import1.0.* Sorry if this incoveniences anyone

Shruthi-M commented 4 years ago

Greetings! I have uploaded a file - details.xlsx. This contains the following data:

pid
Normalized name
scientificName
GBIF key
wiki_id
IF_ACCEPTED _NAMES
IF_SYNONYMS
Common_names
synonyms Columns 6 and 7 I have also uploaded another document called Documentation (details) which contains the code used during the process of obtaining the same. ANALYSIS: The following cases need a review: a) if a taxon is neither accepted nor a synonym, it implies that the name needs review b) if the scientificName column contains the entry as "Plantae" c) if the entries in the column "Normalized name" are marked in red

Shruthi-M commented 4 years ago

The above post is in tables/plant.

petermr commented 4 years ago

The details.xlsx table looks well designed and created. I need to check details - this will take a little time. The wiki_id table is presumably not required as the Wikidata column is already in "details", correct?

On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M notifications@github.com wrote:

The above post is in tables/plant.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 4 years ago

On Tue, 6 Aug 2019 at 04:56, petermr notifications@github.com wrote:

The details.xlsx table looks well designed and created. I need to check details - this will take a little time.

Thank you Sir

The wiki_id table is presumably not required as the Wikidata column is already in "details", correct?

Yes, a separate table is not required.

P.

On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M notifications@github.com wrote:

The above post is in tables/plant.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AMIWRYEBQTEELIOVA5MAWA3QDCZL7A5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3TLWHA#issuecomment-518437660, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIWRYHINTGQEPRJD6AUXC3QDCZL7ANCNFSM4IG3DB3A .

petermr commented 4 years ago

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

These are all good points Peter,

I’ll be taking care of this as the final step before Gita and you get a last look before import.

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus. If she can do this without endangering the chances of having true, correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

Meanwhile…

GO! Shruthi GO! We’re cheering you on to the finish line!!

Manny

Emanuel Faria Founder | Formulator | President emanuel@verriclear.com VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™
North America: www.verriclear.com http://www.verriclear.com/ South America: www.verriclear.com.br http://www.verriclear.com.br/

“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

Albert Einstein -

** CONFIDENTIALITY NOTICE ** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at emanuel@verriclear.com. Thank you for your co-operation.

Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.

VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.

On Aug 6, 2019, at 4:18 AM, petermr notifications@github.com wrote:

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A.

petermr commented 4 years ago

On Tue, Aug 6, 2019 at 8:31 AM Manny notifications@github.com wrote:

These are all good points Peter,

I’ll be taking care of this as the final step before Gita and you get a last look before import.

Thanks so much!

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.

Absolutely agreed.

If she can do this without endangering the chances of having true,

correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we don't have to worry about spelling because they take care of it. So GBIF=2685484 Wikidata=Q146992 species=Abies alba is ALL we have to know for the the first entry. Everything else can be looked up.

"GBIF, what is the preferred taxonomic authority for Abies Alba?" "Abies alba Mill."

"Wikidata , what is the common name for Q146992 in Portuguese" "abeto-prateado"

In particular those two authorities work closely together. They will automatically update when:

a species is reclassified (genus, family)
a new synonym is found
a new authority is added

Also you can automatically ask: "What is the IUCN status of Q146992?" "Least concern"

In this way the things that EssoilDB has to maintain are:

a register of imported articles (bibliography) - Ambarish is doing this
a register of plant species (Shruthi has done this!)
a register of compounds (Ambarish is doing this)
locations (not well advanced)

then:

an import mechanism (PMR)
import checking - yet to be developed but uses core tables
data=> core plant/compound/parts/location/ tables
a search engine (separate from core) using:
- core tables
- plant synonyms from Wikidata, GBIF
- chemical structure search (from CDK - they will be happy to advise)

This design is implicit in the poster which should be an initial guide

I think this is a great time to design in the features that you would find useful. It's a relatively small knowledgebase so systems such as NoSQL or Tidyverse should be considered. Also I want to store a LOT more of the original papers if that would be useful.

Exciting!

I'd very much like to talk again over Skype. I think just you and me if Gita is busy.

Meanwhile…

GO! Shruthi GO! We’re cheering you on to the finish line!!

:D

Manny

Emanuel Faria Founder | Formulator | President emanuel@verriclear.com VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com http://www.verriclear.com/ South America: www.verriclear.com.br http://www.verriclear.com.br/

“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

Albert Einstein -

** CONFIDENTIALITY NOTICE ** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at emanuel@verriclear.com. Thank you for your co-operation.

Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.

VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.

On Aug 6, 2019, at 4:18 AM, petermr notifications@github.com wrote:

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

@Shruthi-M I have found your *.docx file and this looks very good. Am reading it.

petermr commented 4 years ago

Using word documents for docs on Github is not normally a good idea for several reasons.

word can introduce spurious characters especially line ends, smart quotes etc.
Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run. (Same goes for the species/output). [Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Shruthi-M commented 4 years ago

Using word documents for docs on Github is not normally a good idea for several reasons.

word can introduce spurious characters especially line ends, smart quotes etc.

Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run. (Same goes for the species/output). [Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Sure, I will look into this.

EmanuelFaria commented 4 years ago

Thanks Peter,

Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.)

Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong.

I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents). All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field.

Looking forward to getting the cleanups done for the team in a precise manner, yet quick manner.

I have a list of things I’ve found and stored in a “clean up” database, so I can copy and paste the non-space spaces and other anomalies.

Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over.

I’d love to chat with you too!

I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday? Keep in mind I’m in Brazil, so we can work out a good time for both of us.

If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812

Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha)

Good day or night to you, wherever you are.

Manny

On Aug 6, 2019, at 5:30 AM, petermr notifications@github.com wrote:

On Tue, Aug 6, 2019 at 8:31 AM Manny notifications@github.com wrote:

These are all good points Peter,

I’ll be taking care of this as the final step before Gita and you get a last look before import.

Thanks so much!

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.

Absolutely agreed.

If she can do this without endangering the chances of having true,

correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

"GBIF, what is the preferred taxonomic authority for Abies Alba?" "Abies alba Mill."

"Wikidata , what is the common name for Q146992 in Portuguese" "abeto-prateado"

In particular those two authorities work closely together. They will automatically update when:

a species is reclassified (genus, family)
a new synonym is found
a new authority is added

Also you can automatically ask: "What is the IUCN status of Q146992?" "Least concern"

In this way the things that EssoilDB has to maintain are:

a register of imported articles (bibliography) - Ambarish is doing this
a register of plant species (Shruthi has done this!)
a register of compounds (Ambarish is doing this)
locations (not well advanced)

then:

an import mechanism (PMR)
import checking - yet to be developed but uses core tables
data=> core plant/compound/parts/location/ tables
a search engine (separate from core) using:
core tables
plant synonyms from Wikidata, GBIF
chemical structure search (from CDK - they will be happy to advise)

This design is implicit in the poster which should be an initial guide

Exciting!

I'd very much like to talk again over Skype. I think just you and me if Gita is busy.

Meanwhile…

GO! Shruthi GO! We’re cheering you on to the finish line!!

:D

Manny

Emanuel Faria Founder | Formulator | President emanuel@verriclear.com VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com http://www.verriclear.com/ South America: www.verriclear.com.br http://www.verriclear.com.br/

“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

Albert Einstein -

** CONFIDENTIALITY NOTICE ** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at emanuel@verriclear.com. Thank you for your co-operation.

Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.

VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.

On Aug 6, 2019, at 4:18 AM, petermr notifications@github.com wrote:

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M7TWYRNQB3GOTOFRT3QDEZEFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ULQNY#issuecomment-518567991, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJK2M7IGW6J7H4ZIJTXEKLQDEZEFANCNFSM4IG3DB3A.

petermr commented 4 years ago

On Tue, Aug 6, 2019 at 4:22 PM Manny notifications@github.com wrote:

Thanks Peter,

Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.)

EssoilDB1.0 is a finite task. I suspect we don't need to tidy up the whole of the long tail. EssoilDB2.0 will be wonderfully different.

Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong.

The Open community - Wikipedia and others - have some solutions here. I'll tweet it.

I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents).

Let's talk about this. I was under the impression that the activities in E1.0 were inserted from external sources and not from the paper. But I may be wrong. If it is extracting them from the paper we need to talk.

All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field.

We really need GIta's view on this.

I have a list of things I’ve found and stored in a “clean up” database, so

I can copy and paste the non-space spaces and other anomalies.

The database is small enough it fits in Github easily. EssOilDB/v1.0/info_c.tsv is only 38 Mbyte.

Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over.

Ambarish is/has_been working on this. In any case all the data is on Github so we don't need to send it.

I’d love to chat with you too!

I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday? Keep in mind I’m in Brazil, so we can work out a good time for both of us.

I have some ideas about Open Science in LatAm which I'll explain later.

If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 < https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812>

Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha)

I shall be in Edinburgh Thu and Friday. I am happy to try times in the UK in the afternoon and evening.

What I'd like for V2.0 is some use cases. I can't guarantee that they would all be supported. However I would be optimstic about experimental methodology for extraction. Activities will be harder.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 4 years ago

Greetings! I have added a new file at EssOilDB/tables/plant called details.txt.I have added it in the .txt format (not .xlsx) as told. I can separately send the .xlsx file as well (if needed). I noticed that some changes needed to be made in the synonyms column. I have done those. I will send the text version of the R codes soon.

Thank you.