gilienv / EssOilDB

Restructuring of Essential Oil Database
Apache License 2.0
8 stars 6 forks source link

Normalizing_Plant_name #42

Open vinitamehlawat opened 5 years ago

vinitamehlawat commented 5 years ago

Investigate Ross Mounce's recommendation for R-based tool to resolve plant names.

gilienv commented 5 years ago

Please use R taxize package Package is at https://cran.r-project.org/web/packages/taxize/index.html Tutorial is at https://cran.r-project.org/web/packages/taxize/vignettes/name_cleaning.html

gilienv commented 5 years ago

This has already been completed by @ambarishK Results are with Manish. Please check and close this issue

Shruthi-M commented 5 years ago

With reference to the data in the attached file, there are some problems that I have identified as of now. Though the entries 6 and 7 are of the same species, they have different synonyms. The entries 15 and 16 refer to the same plant. There is a small difference in the way they have been spelt. The entry 16 seems to appear in the research articles and only entry 15 has synonyms. Entries 45 and 46 seem to be the same; but, have different information. The same problem is faced with entries 29 and 30. plant_synonym_match.xlsx

Shruthi-M commented 5 years ago

I am hereby sending the file with the modifications. I have not changed the pid of any plant as I thought it could be linked to other data. essoildb.plantdata (1).xlsx The file containing the deleted names along with the reasons for deletions is also attached below. Deletions.docx There are some issues for which I need some clarifications. I have listed them below: 1) pid - 560, 179, 920, 1074 -> have "sp. nov" after the genus in their binomial names. I have not shifted these terms to the column titled "variations". 2) Also, I have not moved the abbreviations/ names of authors (eg: L., Swingle) to the column titled "variations". As I am not sure of what has to be done regarding these, I kindly request you to get back to me about the same. I have run the package taxize (gnr_resolve) for the data on R. I have attached the sample of 100 taxize results and the complete taxize results also. This result needs further analysis (eg.: The information retrieved from the data source - "Catalogue of Life" has resulted in variations in the binomial names). The tool did not show the results of all the names (1804 results were obtained when 1809 names were loaded). sample_taxize_result.xlsx taxize_table.xlsx

ambarishK commented 5 years ago

There is a need to correct spelling of plant species entries.

  1. Please do not discard those entries which are misspelled. Instead correct them using wiki search.
  2. If entry contains only genus OR only species instead of plant species name, add required to complete the entry.
  3. If plant species differ at sub-species level, please do not discard them. If sub-species has different author, then only go for removing them from the list.

I have a list of 642 entries which has not generated search results because of spelling error. So, it need to correct their spelling.

petermr commented 5 years ago

Thank you for this clear and careful analysis. This is a significant improvement to the quality of the plants database.

The question of whether subspecies, hybrids should be treated explicitly is a hard one. It's certaily possible that both of these have different phytochemical profiles. On the other hand the nomenclature in the articles is probably variable and uncontrolled. I think what you have is a good start. We'll talk next time. (Similar ambiguities occur with chemistry). We have to remember that authors have different standards of reporting names.

The authority label (e.g. L.) should not be part of the name. It's unlikely that the identity of the plant will change (although there can be ambiguity occasionally).

I believe that spelling errors should be corrected. In some cases we may also need to disambiguate older synonyms (Ocimum sanctum vs O. tenuiflorum). Our job is to make the best estimate of what the plant was but not.

On Fri, Jun 28, 2019 at 11:49 AM Shruthi-M notifications@github.com wrote:

I am hereby sending the file with the modifications. I have not changed the pid of any plant as I thought it could be linked to other data. essoildb.plantdata (1).xlsx https://github.com/gilienv/EssOilDB/files/3338676/essoildb.plantdata.1.xlsx The file containing the deleted names along with the reasons for deletions is also attached below. Deletions.docx https://github.com/gilienv/EssOilDB/files/3338691/Deletions.docx There are some issues for which I need some clarifications. I have listed them below:

  1. pid - 560, 179, 920, 1074 -> have "sp. nov" after the genus in their binomial names. I have not shifted these terms to the column titled "variations".

See Wikipedia. https://en.wikipedia.org/wiki/Species_nova. I think it can be dropped.

  1. Also, I have not moved the abbreviations/ names of authors (eg: L., Swingle) to the column titled "variations".

I think we should.

  1. As I am not sure of what has to be done regarding these, I kindly request you to get back to me about the same. I have run the package taxize (gnr_resolve) for the data on R. I have attached the sample of 100 taxize results and the complete taxize results also. This result needs further analysis (eg.: The information retrieved from the data source - "Catalogue of Life" has resulted in variations in the binomial names). The tool did not show the results of all the names (1804 results were obtained when 1809 names were loaded). sample_taxize_result.xlsx https://github.com/gilienv/EssOilDB/files/3338787/sample_taxize_result.xlsx taxize_table.xlsx https://github.com/gilienv/EssOilDB/files/3338788/taxize_table.xlsx

I will have a look and mail back.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/42?email_source=notifications&email_token=AAFTCS6JRMFDPVCKXRG45GDP4XUCPA5CNFSM4H2QXJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYZYBHY#issuecomment-506691743, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZTEHQJ2QHAC346CF3P4XUCPANCNFSM4H2QXJ6Q .

-- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

petermr commented 5 years ago

On Fri, Jun 28, 2019 at 1:38 PM Ambarish Kumar notifications@github.com wrote:

There is need to correct spelling of plant species entries.

  1. Please do not discard those entries which are misspelled. Instead correct them using wiki search.
  2. If entry contains only genus OR only species instead of plant species name, add required to complete the entry.
  3. If plant species differ at sub-species level, please do not discard them. If sub-species has different author, then only go for removing them from the list.

I have a list of 642 entries which has not generated search results because of spelling error. So, it need to correct their spelling.

Please leave this policy to the plant scientists - we discuss it regularly

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/42?email_source=notifications&email_token=AAFTCSZHW4324RR7TACKBDLP4YA6DA5CNFSM4H2QXJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYZ6V3Q#issuecomment-506718958, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS3FXER6UPC6RAZTT3LP4YA6DANCNFSM4H2QXJ6Q .

-- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

petermr commented 5 years ago

On Fri, Jun 28, 2019 at 11:49 AM Shruthi-M notifications@github.com wrote:

I am hereby sending the file with the modifications. I have not changed the pid of any plant as I thought it could be linked to other data. essoildb.plantdata (1).xlsx https://github.com/gilienv/EssOilDB/files/3338676/essoildb.plantdata.1.xlsx The file containing the deleted names along with the reasons for deletions is also attached below. Deletions.docx https://github.com/gilienv/EssOilDB/files/3338691/Deletions.docx There are some issues for which I need some clarifications. I have listed them below:

1. I have run the package taxize (gnr_resolve) for the data on R. I have attached the sample of 100 taxize results and the complete taxize results also. This result needs further analysis (eg.: The information retrieved from the data source - "Catalogue of Life" has resulted in variations in the binomial names). The tool did not show the results of all the names (1804 results were obtained when 1809 names were loaded). sample_taxize_result.xlsx https://github.com/gilienv/EssOilDB/files/3338787/sample_taxize_result.xlsx taxize_table.xlsx https://github.com/gilienv/EssOilDB/files/3338788/taxize_table.xlsx

I have looked briefly at this - it seems that almost everything matches. It may be useful to record those entries with a serious problem that cannot be resolved. Looks promising.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/42?email_source=notifications&email_token=AAFTCS6JRMFDPVCKXRG45GDP4XUCPA5CNFSM4H2QXJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYZYBHY#issuecomment-506691743, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZTEHQJ2QHAC346CF3P4XUCPANCNFSM4H2QXJ6Q .

-- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Shruthi-M commented 5 years ago

This is the file containing the complete taxize results: taxize_result_2.xlsx After reviewing the results, I noticed that the plant - Serotinocarpum insignis is the only taxon which is not being recognized by the taxize package. I verified the taxon's existence with The Plant List database (one of the databases not used by taxize) using another R package - Taxonstand, which also yielded the result as FALSE. This is the article from which it is taken - https://www.tandfonline.com/doi/pdf/10.1080/10412905.2004.9698667?needAccess=true. I was unable to find the mention of this taxon in any other article or datasource.

petermr commented 5 years ago

Well done.

I am not an expert but it looks as if they have got this garbled. We should keep the binomial, but note in "details" field that this species cannot be identified.

Make sure that your work is documented.

do you have issues you want to talk about? We can arrange a telcon, maybe tomorrow?

And I think that we can start thinking about the chemistry. I can't remember what software or services was recommended.

On Tue, Jul 2, 2019 at 10:57 AM Shruthi-M notifications@github.com wrote:

This is the file containing the complete taxize results: taxize_result_2.xlsx https://github.com/gilienv/EssOilDB/files/3349356/taxize_result_2.xlsx After reviewing the results, I noticed that the plant - Serotinocarpum insignis is the only taxon which is not being recognized by the taxize package. I verified the taxon's existence with The Plant List database (one of the databases not used by taxize) using another R package - Taxonstand, which also yielded the result as FALSE. This is the article from which it is taken - https://www.tandfonline.com/doi/pdf/10.1080/10412905.2004.9698667?needAccess=true . I was unable to find the mention of this taxon in any other article or datasource.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/42?email_source=notifications&email_token=AAFTCSYVMFX2AFQRWRS3VCTP5MQ7FA5CNFSM4H2QXJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZAXU6Q#issuecomment-507607674, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZUI7RBQZYOLGKOGJDP5MQ7FANCNFSM4H2QXJ6Q .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M commented 5 years ago

On Tue, 2 Jul 2019 at 15:58, petermr notifications@github.com wrote:

Well done.

Thank you Sir.

I am not an expert but it looks as if they have got this garbled. We should

keep the binomial, but note in "details" field that this species cannot be identified.

Ok. I shall do accordingly.

Make sure that your work is documented.

Yes Sir, all the work is being documented.

do you have issues you want to talk about? We can arrange a telcon, maybe

tomorrow?

No Sir. As of now, the process is proceeding smoothly.

And I think that we can start thinking about the chemistry. I can't

remember what software or services was recommended.

Sure. We can start this step. I am not aware of any software or service. Is it possible for us to have a telcon to discuss about the software and the tasks that have to be further performed?

On Tue, Jul 2, 2019 at 10:57 AM Shruthi-M notifications@github.com wrote:

This is the file containing the complete taxize results: taxize_result_2.xlsx https://github.com/gilienv/EssOilDB/files/3349356/taxize_result_2.xlsx After reviewing the results, I noticed that the plant - Serotinocarpum insignis is the only taxon which is not being recognized by the taxize package. I verified the taxon's existence with The Plant List database (one of the databases not used by taxize) using another R package - Taxonstand, which also yielded the result as FALSE. This is the article from which it is taken -

https://www.tandfonline.com/doi/pdf/10.1080/10412905.2004.9698667?needAccess=true . I was unable to find the mention of this taxon in any other article or datasource.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/gilienv/EssOilDB/issues/42?email_source=notifications&email_token=AAFTCSYVMFX2AFQRWRS3VCTP5MQ7FA5CNFSM4H2QXJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZAXU6Q#issuecomment-507607674 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAFTCSZUI7RBQZYOLGKOGJDP5MQ7FANCNFSM4H2QXJ6Q

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/42?email_source=notifications&email_token=AMIWRYCGTZBCQGUUDD7DCPDP5MUUVA5CNFSM4H2QXJ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZA2CGA#issuecomment-507617560, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIWRYHSUYSDT4TR233KZALP5MUUVANCNFSM4H2QXJ6Q .