gilienv / EssOilDB

Restructuring of Essential Oil Database
Apache License 2.0
8 stars 6 forks source link

Disambiguation of plant names using GBIF #82

Open Shruthi-M opened 4 years ago

Shruthi-M commented 4 years ago

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

  1. occurrenceId
  2. verbatimScientificName (user-submitted name)
  3. scientificName (name existing in the database)
  4. key (unique number assigned to the particular species on GBIF
  5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
    • EXACT means the name exactly matches with the entry in the database
    • FUZZY indicates entries that may be mis-spelt
    • HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
  6. confidence (expressed in terms of percentage)
  7. status (can be ACCEPTED, SYNONYM or DOUBTFUL)
    • DOUBTFUL Treated as accepted, but doubtful whether this is correct.
    • SYNONYM A general synonym, the exact type is unknown.
  8. rank (the highest rank recognized)
  9. kingdom
  10. phylum
  11. class
  12. order
  13. family
  14. genus
  15. species
petermr commented 4 years ago

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

On Thu, 25 Jul 2019, 15:13 Shruthi-M, notifications@github.com wrote:

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

  1. occurrenceId

  2. verbatimScientificName (user-submitted name)

  3. scientificName (name existing in the database)

  4. key (unique number assigned to the particular species on GBIF

  5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

    • EXACT means the name exactly matches with the entry in the database
    • FUZZY indicates entries that may be mis-spelt
    • HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
  6. confidence (expressed in terms of percentage)

  7. status (can be ACCEPTED, SYNONYM or DOUBTFUL)

    • DOUBTFUL Treated as accepted, but doubtful whether this is correct.
    • SYNONYM A general synonym, the exact type is unknown.
  8. rank (the highest rank recognized)

  9. kingdom

  10. phylum

  11. class

  12. order

  13. family

  14. genus

  15. species

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A .

petermr commented 4 years ago

Shruthi , This is excellent. We should preserve this table Then we will normalize. We should classify the results into the major groups. Initial comments:

occurrenceId verbatimScientificName scientificName key matchType confidence
status rank kingdom phylum class order family genus species

occurrenceId // these were all blank, so we can drop this verbatimScientificName // this is our initial raw data and must be preserved. Let's use GBIF terminology where possible, so keep this column name scientificName // the preferred name. Can include synonyms. We should not use this if there is a species

key // this is the most important column and gives us all the normalized information we need matchType // Yes, we should keep this because it helps understand non-normalized species confidence // whats' the lowest? I think we can drop this later status // useful for non-normalized names rank // useful for non-normalized names kingdom phylum class order family genus // probably keep. GBIF seems to map unknown species to genus. species // the key normalization

Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae
Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima

^^ duplicates. Why? we can get rid of these immediately

Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM
SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea
Achillea tenuifolia

This is a single synonym but we should use the species name "Achillea tenuifolia" for future matching, not the scientificName "Achillea albicaulis". As always the key is the critical column.

,"Ocimum sanctum","Ocimum sanctum
L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
,"Ocimum tenuiflorum","Ocimum tenuiflorum
L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"

These are synonyms but have different keys. So in our normalized table there should only be the ACCEPTED. Normalization should be on "species"

Let's summarize and make a list of ACCEPTED species

SYNONYMS can be removed if there is an ACCEPTED species SYNONYM without ACCEPTED equivalent should be normalized on the species

Everything else shouldbe separated out as we will have to discuss it.

Well done.

On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

On Thu, 25 Jul 2019, 15:13 Shruthi-M, notifications@github.com wrote:

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

  1. occurrenceId

  2. verbatimScientificName (user-submitted name)

  3. scientificName (name existing in the database)

  4. key (unique number assigned to the particular species on GBIF

  5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

    • EXACT means the name exactly matches with the entry in the database
    • FUZZY indicates entries that may be mis-spelt
    • HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
  6. confidence (expressed in terms of percentage)

  7. status (can be ACCEPTED, SYNONYM or DOUBTFUL)

    • DOUBTFUL Treated as accepted, but doubtful whether this is correct.
    • SYNONYM A general synonym, the exact type is unknown.
  8. rank (the highest rank recognized)

  9. kingdom

  10. phylum

  11. class

  12. order

  13. family

  14. genus

  15. species

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

We are going to need unique identifiers for all these accepted species, e.g. PL123 we need a column for EssoilDB plant key.

On Thu, Jul 25, 2019 at 9:29 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Shruthi , This is excellent. We should preserve this table Then we will normalize. We should classify the results into the major groups. Initial comments:

occurrenceId verbatimScientificName scientificName key matchType
confidence status rank kingdom phylum class order family genus species

occurrenceId // these were all blank, so we can drop this verbatimScientificName // this is our initial raw data and must be preserved. Let's use GBIF terminology where possible, so keep this column name scientificName // the preferred name. Can include synonyms. We should not use this if there is a species

key // this is the most important column and gives us all the normalized information we need matchType // Yes, we should keep this because it helps understand non-normalized species confidence // whats' the lowest? I think we can drop this later status // useful for non-normalized names rank // useful for non-normalized names kingdom phylum class order family genus // probably keep. GBIF seems to map unknown species to genus. species // the key normalization

Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae
Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima

^^ duplicates. Why? we can get rid of these immediately

Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM
SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea
Achillea tenuifolia

This is a single synonym but we should use the species name "Achillea tenuifolia" for future matching, not the scientificName "Achillea albicaulis". As always the key is the critical column.

,"Ocimum sanctum","Ocimum sanctum
L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
,"Ocimum tenuiflorum","Ocimum tenuiflorum
L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"

These are synonyms but have different keys. So in our normalized table there should only be the ACCEPTED. Normalization should be on "species"

Let's summarize and make a list of ACCEPTED species

SYNONYMS can be removed if there is an ACCEPTED species SYNONYM without ACCEPTED equivalent should be normalized on the species

Everything else shouldbe separated out as we will have to discuss it.

Well done.

On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

On Thu, 25 Jul 2019, 15:13 Shruthi-M, notifications@github.com wrote:

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows:

  1. occurrenceId

  2. verbatimScientificName (user-submitted name)

  3. scientificName (name existing in the database)

  4. key (unique number assigned to the particular species on GBIF

  5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

    • EXACT means the name exactly matches with the entry in the database
    • FUZZY indicates entries that may be mis-spelt
    • HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
  6. confidence (expressed in terms of percentage)

  7. status (can be ACCEPTED, SYNONYM or DOUBTFUL)

    • DOUBTFUL Treated as accepted, but doubtful whether this is correct.
    • SYNONYM A general synonym, the exact type is unknown.
  8. rank (the highest rank recognized)

  9. kingdom

  10. phylum

  11. class

  12. order

  13. family

  14. genus

  15. species

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

@Shruthi-M - this is so important for all of us! Everyone should read this thread. I'll annotate @Shruthi-M 's table and add actions. We should extract the discussion here onto *.md pages as well. There are fundamental issues which apply to compounds as well, @ambarishK . I hadn't seen this clearly until I started on the poster. We are only dealing at present with converting exisiting EssoilDB 1.0 (E1.0) to E2.0 (i.e not worrying about ingesting new data into either).

= Origin of data = It's critical to review exactly where the data comes from. After talking with @gilienv yesterday I believe that :

== ACTION == We have to agree and then document what is in EssoilDB 1.0

Shruthi-M commented 4 years ago

Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week). ISSUES BEING FACED:

  1. Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”.
  2. Kunzea ambigua [pid: 879] - The literature from which this is taken - was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database.
  3. Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated.
  4. Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further.
  5. There are 11 binomial names that are shown to be DOUBTFUL.
  6. The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this.
  7. A final list of accepted species (i.e. not synonyms) has to be prepared.
  8. This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary.
petermr commented 4 years ago

Thanks so much Shruthi, One feature of data is that there is always a "long tail". https://en.wikipedia.org/wiki/Long_tail . A few items that can't be easily processed. The most important thing at present is to resolve the largest chunks of names as effeciently as possible. I'll try to highlight a strategy today based on the very useful output from GBIF you created. If there is a species that occurs only once and we can't resolve it, compared with one that occurs 10 times and we can, we prioritise the latter.

P.

On Sat, Jul 27, 2019 at 8:54 AM Shruthi-M notifications@github.com wrote:

Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week). ISSUES BEING FACED:

  1. Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”.
  2. Kunzea ambigua [pid: 879] - The literature from which this is taken
    • was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database.
  3. Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated.
  4. Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further.
  5. There are 11 binomial names that are shown to be DOUBTFUL.
  6. The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this.
  7. A final list of accepted species (i.e. not synonyms) has to be prepared.
  8. This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCSYLELWVYH6E6YTFLT3QBP5MPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26GLRA#issuecomment-515663300, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZGSJGU4EUOZXUEXU3QBP5MPANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

I shall add comments on your very useful output (which can be viewed directly in table form on Github): https://github.com/gilienv/EssOilDB/blob/master/tables/plant/gbif_result.tsv

Note that there is no EssoilDB ID for each row so I shall refer to this table in unsorted fashion. This is why we need an ID!

petermr commented 4 years ago

After deduplication here are my comments.

Vachellia caven is not mentioned in V1.0 so we lookup Vachellia caven in GBIF to give https://www.gbif.org/species/3795588. We then use Vachellia caven (GBIF 3795588) as the accepted name with Acacia caven (GBIF 2979244) as a synonym. ACTION agree this strategy.

recommendation

petermr commented 4 years ago

list of problem species

Shruthi has created a report with a number of problems of names. She has actually gone back to priginal papers. Suggests she copy the data here.

I have also found some problems which seem to be different, and add some suggestions.

species with unusual synonyms or mapping onto more than one species.

Requires hand editing

Achillea depressa
Achillea stricta
Achillea tanacetifolia
Aloysia triphylla
Anthemis altissima
Artemisia coerulescens
Artemisia fragrans
Artemisia gallica
Artemisia herba-alba
Athrotaxis taxifolia
Cedrus liobani
Chenopodium ambrosioides
Cinnamomum fragrans
Cinnamomum zeylanicum
Coleus Aromaticus
Dracocephalum speciosum
Echinophora chysantha
Eclipta indica
Eryngium caeruleum
Eucalyptus viridiflora
Eugenia nitida
Eugenia ovalifolia
Eugenia rotundifolia
Lavandula hybrida
Lindera strychnifolia
Lippia gracillis
Mentha gracilis
Micromeria dalmatica
Nepeta fissa
Ocimum adscendens
Oenanthe divaricata
Origanum basilicum
Origanum micranthum
Pinus laricio
Pluchea purpurascens
Polymnia sonchifolia
Satureja viminea
Senecio farfarifolius
Stachys lanata
Tanacetum elburensis
Thymus capitatus
Thymus caucasicus
Thymus ciliates
Thymus hirtus

hybrids

Probably best represented at genus level

Citrus reticulata x Citrus sinensis
Citrus latifolia Tanaka x Citrus aurantifolia Swingle
Citrus paradisi x Citrus. reticulata
Citrus unshiu x Citrus nobilis
Eucalyptus citriodora x E.torelliana
Lavandula luisieri x Lavandula stoechas

and these are probably hybrids (assume the non-Unicode char is 'times' symbol.

Mentha •À_ piperita
Mentha•À_longifolia•À_L.
Peperomia•À_pellucida•À_L.

genus

These entries are only interpretable at genus level.

Astartea sp. nov.
Calamintha var.darensis
Callistemon sp.
Eryngium sp nov.
Eryngium spp.
Eugenia sp.
Hypericum 'Hidcote'
Kunzea sp.
Mentha spp.
Mikania sp.nov.
Origanum spp.
Persea
Xanthostemon spp.
Renealmia spp.

typos

Species require lowercase specific name.

Stachys Corsica
Tordylium Ketenoglui

unknown species

Lomatopodium khorassanicum
Serotinocarpum insignis
petermr commented 4 years ago

Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well. P.

petermr commented 4 years ago

Shruthi, When you have diambiguated (most) of the plant species can you lookup their IDs in Wikidata? I wrote a simple tool in Feb for the workshop, but it was a bit slow - had to lookup one-by-one. There may be better tools now - I can ask...

Shruthi-M commented 4 years ago

Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well. P.

Sir Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries. I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM". I am currently working on this. As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Thank you for your guidance.

petermr commented 4 years ago

On Mon, Jul 29, 2019 at 11:23 AM Shruthi-M notifications@github.com wrote:

Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well. P.

Sir Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries.

I can understand there is a lot to do.

I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM".

What is the purpose of SYNONYM? Is it for searching? In which case it can be automatically generated from the GBIF identifier when needed.

I am currently working on this. As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Understood. I will mail Gita.

Thank you for your guidance.

It is a pleasure to work with you.

P.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCSYX2KBHUBZYQNNHMX3QB3AIPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3AI4SQ#issuecomment-515935818, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZ2VQGKB2VCYSXRAGTQB3AIPANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 4 years ago

Greetings! I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.]

  1. pid - as per EssoilDB 1.0
  2. pname - as existing in EssoilDB 1.0
  3. scientificName (gbif) - results obtained from GBIF
  4. Normalized name
  5. Details - about the author, subspecies, variety, etc.
  6. pfid
  7. phid
  8. Error
  9. kingdom
  10. phylum
  11. class
  12. order
  13. family
  14. genus
  15. species
  16. Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma.

The entries that are modified/ need modification are in red.

petermr commented 4 years ago

Thanks Good to see this is a separate table. Will look later today

On Tue, 30 Jul 2019, 07:34 Shruthi-M, notifications@github.com wrote:

Greetings! I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.]

  1. pid - as per EssoilDB 1.0
  2. pname - as existing in EssoilDB 1.0
  3. scientificName (gbif) - results obtained from GBIF
  4. Normalized name
  5. Details - about the author, subspecies, variety, etc.
  6. pfid
  7. phid
  8. Error
  9. kingdom
  10. phylum
  11. class
  12. order
  13. family
  14. genus
  15. species
  16. Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma.

The entries that are modified/ need modification are in red.

  • The hybrids are yet to be resolved

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS2JP7VJJPCVLTYAPSTQB7OGLA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3C54CI#issuecomment-516283913, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2MPO62MZHZ37PYG3TQB7OGLANCNFSM4IG3DB3A .

Shruthi-M commented 4 years ago

I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata.

petermr commented 4 years ago

Many thanks!

On Thu, Aug 1, 2019 at 10:47 AM Shruthi-M notifications@github.com wrote:

I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCSYTW6XPTNSDUKQ23P3QCKWMFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3KATKQ#issuecomment-517212586, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS54SMRWJI5MGLLJCWLQCKWMFANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

@Shruthi-M This is wonderful! You have done a good job. Could you please add:

[Although you may write this in the report the people using the plant data may not have access, so make sure the doc is in the directory].

I have renamed the major table to ingestion and created a TSV version.

petermr commented 4 years ago

UNIQUE IDENTIFIERS for plants. Now is the time to freeze the number of entries being imported from V1.0. There are 1838 plant entries and you have generated a unique ID for each record. This ID must always be associated with the same record. If records are deleted we NEVER reuse that identifier. I think the identifiers should have a leading letter or more This has several advantages:

So I suggest:

The question is whether we create identifiers of fixed length, e.g. EP0001234 Since Wikidata and others don't I suggest we DONT worry about length.

EmanuelFaria commented 4 years ago
        >>Manny >Before Re-importing into the database, I’d like to get a shot at eliminating any invisible characters and othe anomalies please.---- On Thu, 01 Aug 2019 16:22:00 -0400  

PMR>> Absolutely!!

The characters should ONLY be Unicode 32-126. We will test for that. All other characters must be mapped onto these.

(BTW when replying to Github issues, try to eliminate all copy of previous posts, signatures, routing etc.)

petermr commented 4 years ago

I have renamed @Shruthi-M tables to tables/plant/import1.0.* Sorry if this incoveniences anyone

Shruthi-M commented 4 years ago

Greetings! I have uploaded a file - details.xlsx. This contains the following data:

  1. pid
  2. Normalized name
  3. scientificName
  4. GBIF key
  5. wiki_id
  6. IF_ACCEPTED _NAMES
  7. IF_SYNONYMS
  8. Common_names
  9. synonyms Columns 6 and 7 I have also uploaded another document called Documentation (details) which contains the code used during the process of obtaining the same. ANALYSIS: The following cases need a review: a) if a taxon is neither accepted nor a synonym, it implies that the name needs review b) if the scientificName column contains the entry as "Plantae" c) if the entries in the column "Normalized name" are marked in red
Shruthi-M commented 4 years ago

The above post is in tables/plant.

petermr commented 4 years ago

The details.xlsx table looks well designed and created. I need to check details - this will take a little time. The wiki_id table is presumably not required as the Wikidata column is already in "details", correct?

P.

On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M notifications@github.com wrote:

The above post is in tables/plant.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 4 years ago

On Tue, 6 Aug 2019 at 04:56, petermr notifications@github.com wrote:

The details.xlsx table looks well designed and created. I need to check details - this will take a little time.

Thank you Sir

The wiki_id table is presumably not required as the Wikidata column is already in "details", correct?

Yes, a separate table is not required.

P.

On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M notifications@github.com wrote:

The above post is in tables/plant.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AMIWRYEBQTEELIOVA5MAWA3QDCZL7A5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3TLWHA#issuecomment-518437660, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIWRYHINTGQEPRJD6AUXC3QDCZL7ANCNFSM4IG3DB3A .

petermr commented 4 years ago

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

These are all good points Peter,

I’ll be taking care of this as the final step before Gita and you get a last look before import.

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus. If she can do this without endangering the chances of having true, correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

Meanwhile…

GO! Shruthi GO! We’re cheering you on to the finish line!!

:D

Manny

Emanuel Faria Founder | Formulator | President emanuel@verriclear.com VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™
North America: www.verriclear.com http://www.verriclear.com/ South America: www.verriclear.com.br http://www.verriclear.com.br/


“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

** CONFIDENTIALITY NOTICE ** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at emanuel@verriclear.com. Thank you for your co-operation.

Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.

VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.

On Aug 6, 2019, at 4:18 AM, petermr notifications@github.com wrote:

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A.

petermr commented 4 years ago

On Tue, Aug 6, 2019 at 8:31 AM Manny notifications@github.com wrote:

These are all good points Peter,

I’ll be taking care of this as the final step before Gita and you get a last look before import.

Thanks so much!

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.

Absolutely agreed.

If she can do this without endangering the chances of having true,

correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we don't have to worry about spelling because they take care of it. So GBIF=2685484 Wikidata=Q146992 species=Abies alba is ALL we have to know for the the first entry. Everything else can be looked up.

"GBIF, what is the preferred taxonomic authority for Abies Alba?" "Abies alba Mill."

"Wikidata , what is the common name for Q146992 in Portuguese" "abeto-prateado"

In particular those two authorities work closely together. They will automatically update when:

Also you can automatically ask: "What is the IUCN status of Q146992?" "Least concern"

In this way the things that EssoilDB has to maintain are:

then:

This design is implicit in the poster which should be an initial guide

I think this is a great time to design in the features that you would find useful. It's a relatively small knowledgebase so systems such as NoSQL or Tidyverse should be considered. Also I want to store a LOT more of the original papers if that would be useful.

Exciting!

I'd very much like to talk again over Skype. I think just you and me if Gita is busy.

Meanwhile…

GO! Shruthi GO! We’re cheering you on to the finish line!!

:D

Manny

Emanuel Faria Founder | Formulator | President emanuel@verriclear.com VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com http://www.verriclear.com/ South America: www.verriclear.com.br http://www.verriclear.com.br/


“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

  • Albert Einstein -

** CONFIDENTIALITY NOTICE ** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at emanuel@verriclear.com. Thank you for your co-operation.

Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.

VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.

On Aug 6, 2019, at 4:18 AM, petermr notifications@github.com wrote:

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

@Shruthi-M I have found your *.docx file and this looks very good. Am reading it.

petermr commented 4 years ago

Using word documents for docs on Github is not normally a good idea for several reasons.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run. (Same goes for the species/output). [Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Shruthi-M commented 4 years ago

Using word documents for docs on Github is not normally a good idea for several reasons.

  • word can introduce spurious characters especially line ends, smart quotes etc.
  • Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run. (Same goes for the species/output). [Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Sure, I will look into this.

EmanuelFaria commented 4 years ago

Thanks Peter,

Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.)

Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong.

I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents). All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field.

Looking forward to getting the cleanups done for the team in a precise manner, yet quick manner.

I have a list of things I’ve found and stored in a “clean up” database, so I can copy and paste the non-space spaces and other anomalies.

Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over.

I’d love to chat with you too!

I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday? Keep in mind I’m in Brazil, so we can work out a good time for both of us.

If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812

Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha)

Good day or night to you, wherever you are.

Manny

On Aug 6, 2019, at 5:30 AM, petermr notifications@github.com wrote:

On Tue, Aug 6, 2019 at 8:31 AM Manny notifications@github.com wrote:

These are all good points Peter,

I’ll be taking care of this as the final step before Gita and you get a last look before import.

Thanks so much!

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.

Absolutely agreed.

If she can do this without endangering the chances of having true,

correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we don't have to worry about spelling because they take care of it. So GBIF=2685484 Wikidata=Q146992 species=Abies alba is ALL we have to know for the the first entry. Everything else can be looked up.

"GBIF, what is the preferred taxonomic authority for Abies Alba?" "Abies alba Mill."

"Wikidata , what is the common name for Q146992 in Portuguese" "abeto-prateado"

In particular those two authorities work closely together. They will automatically update when:

Also you can automatically ask: "What is the IUCN status of Q146992?" "Least concern"

In this way the things that EssoilDB has to maintain are:

then:

This design is implicit in the poster which should be an initial guide

I think this is a great time to design in the features that you would find useful. It's a relatively small knowledgebase so systems such as NoSQL or Tidyverse should be considered. Also I want to store a LOT more of the original papers if that would be useful.

Exciting!

I'd very much like to talk again over Skype. I think just you and me if Gita is busy.

Meanwhile…

GO! Shruthi GO! We’re cheering you on to the finish line!!

:D

Manny

Emanuel Faria Founder | Formulator | President emanuel@verriclear.com VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com http://www.verriclear.com/ South America: www.verriclear.com.br http://www.verriclear.com.br/


“If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

  • Albert Einstein -

** CONFIDENTIALITY NOTICE ** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at emanuel@verriclear.com. Thank you for your co-operation.

Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.

VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.

On Aug 6, 2019, at 4:18 AM, petermr notifications@github.com wrote:

@Shruthi Mohan shruthibgr@gmail.com - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory?

Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/82?email_source=notifications&email_token=ACJK2M7TWYRNQB3GOTOFRT3QDEZEFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ULQNY#issuecomment-518567991, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJK2M7IGW6J7H4ZIJTXEKLQDEZEFANCNFSM4IG3DB3A.

petermr commented 4 years ago

On Tue, Aug 6, 2019 at 4:22 PM Manny notifications@github.com wrote:

Thanks Peter,

Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.)

EssoilDB1.0 is a finite task. I suspect we don't need to tidy up the whole of the long tail. EssoilDB2.0 will be wonderfully different.

Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong.

The Open community - Wikipedia and others - have some solutions here. I'll tweet it.

I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents).

Let's talk about this. I was under the impression that the activities in E1.0 were inserted from external sources and not from the paper. But I may be wrong. If it is extracting them from the paper we need to talk.

All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field.

We really need GIta's view on this.

I have a list of things I’ve found and stored in a “clean up” database, so

I can copy and paste the non-space spaces and other anomalies.

The database is small enough it fits in Github easily. EssOilDB/v1.0/info_c.tsv is only 38 Mbyte.

Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over.

Ambarish is/has_been working on this. In any case all the data is on Github so we don't need to send it.

I’d love to chat with you too!

I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday? Keep in mind I’m in Brazil, so we can work out a good time for both of us.

I have some ideas about Open Science in LatAm which I'll explain later.

If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 < https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812>

Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha)

I shall be in Edinburgh Thu and Friday. I am happy to try times in the UK in the afternoon and evening.

What I'd like for V2.0 is some use cases. I can't guarantee that they would all be supported. However I would be optimstic about experimental methodology for extraction. Activities will be harder.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 4 years ago

Greetings! I have added a new file at EssOilDB/tables/plant called details.txt.I have added it in the .txt format (not .xlsx) as told. I can separately send the .xlsx file as well (if needed). I noticed that some changes needed to be made in the synonyms column. I have done those. I will send the text version of the R codes soon.

Thank you.