gilienv / EssOilDB

Restructuring of Essential Oil Database
Apache License 2.0
8 stars 6 forks source link

Import1.0 bibliography #90

Open petermr opened 5 years ago

petermr commented 5 years ago

@ambarishK has created and upload a table of the bibliography from V1.0. I have moved this to

EssoilDB/tables/bibliography/import1.0.csv

and I have exported as

EssoilDB/tables/bibliography/import1.0.tsv

NOTE: There are encoding problems and some of the titles and authors are corrupted. However I suggest that if we can recover the DOI, then we can recover the title from Crossref if we need it and that we should accept Crossref's title/authors.

petermr commented 5 years ago

A sample of 100 entries https://github.com/gilienv/EssOilDB/blob/master/tables/bibliography/sample.tsv looks like:

title   author  DOI_link    DOI vol JOURNAL profile_c
(Z)-ë_-Ocimene from...  Joseph J. Brophy ...    https://doi.org/10.1080/10412905.1998.9700889   10.1080/10412905.1998.9700889   VOL. 10, 229-233 (Mar/Apr 1998) Journal of Essential Oil Research   JEHflau1998Lea#JEHmoau1998Lea
1, 8-Cineole-Caryophyllene ...  Danute Mockute, ... https://doi.org/10.1080/10412905.2004.9698708   10.1080/10412905.2004.9698708   VOL. 16, 236-238 (May/June 2004)    Journal of Essential Oil Research   JETseli2004Aer
1,10-beta-Epoxy-6-oxofura ... royleanus DC. ... https://doi.org/10.1080/10412905.2011.9700434   10.1080/10412905.2011.9700434   VOL. 23, 102-104 (Jan/Feb 2011) Journal of Essential Oil Research   JESrokeutin2011Lea#JESrokeutin2011Ste#JESrokeutin2011Flo#JESrokeutin2011Aer

Looks useful.

ACTION Need a unique ID for each row. Format EBib0001234

ACTION remove columns In production table (not this one) the DOI link, and profile_c will be redundant. The title, authors and journal will be retrieved from Crossref or other authority.

petermr commented 5 years ago

It will be useful to resolve the DOIs in EuropePMC to see how many of these are OpenAccess.

Have used the EPMC API to retrieve metadata for each of the bibliographic entries (1402). see: https://github.com/gilienv/EssOilDB/tree/master/tables/bibliography/epmc

The script: https://github.com/gilienv/EssOilDB/tree/master/tables/bibliography/epmc/epmcopen.sh uses curl to retrieve metadata.

#! /bin/sh

sleep 1
curl -o 10.1002_ffj.1019.xml -k https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=10.1002/ffj.1019&format=xml 
sleep 1
curl -k -o 10.1002_ffj.1047.xml https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=10.1002/ffj.1047&format=xml 
sleep 1
curl -k -o 10.1002_ffj.1048.xml https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=10.1002/ffj.1048&format=xml 
...

For each paper there is a metadata file *.xml which can be interrogated for the phrase:

<isOpenAccess>Y</isOpenAccess>

In V1.0 there are very few OA articles and we'll download them. But in the wider world there are lots that can go into V2.0

ambarishK commented 5 years ago

Sir, Please go through the bibliography table with uniqueID and removed columns - DOI_link and profile_c.

petermr commented 5 years ago

This is not a bibliography, it is a list of titles. It's not useful. Where are the DOIs? and at this stage we should retain the rest of the fields in this table - journal, authors, pages, year,

On Mon, Aug 5, 2019 at 9:32 AM Ambarish Kumar notifications@github.com wrote:

Sir, Please go through the (bibliography)[ https://github.com/gilienv/EssOilDB/blob/master/tables/bibliography/bibliography050819.csv] table with uniqueID and removed columns - DOI_link and profile_c.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/90?email_source=notifications&email_token=AAFTCS3E6YVG7VJMFRMDSF3QC7QS3A5CNFSM4II47EL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3RC6HQ#issuecomment-518139678, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSZNNGVJO7YXMPLYBWLQC7QS3ANCNFSM4II47ELQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir, There are following columns into the bibliography table.

petermr commented 5 years ago

Thank you, This looks fine. Have you checked for duplicates? And are all the charcters Unicode/UTF-8

ambarishK commented 5 years ago

Sir, Character encoding is not as of Unicode/UTF-8.

e.g -

  1. Analysis of Essential Oils from Wild and Domesticated Plants of Glechoma sardoa Bég.
  2. (Z)-β-Ocimene from Two Species of Homoranthus (Myrtaceae).
petermr commented 5 years ago

On Wed, Aug 7, 2019 at 8:06 AM Ambarish Kumar notifications@github.com wrote:

Sir, Character encoding is not as of Unicode/UTF-8.

e.g -

  1. Analysis of Essential Oils from Wild and Domesticated Plants of Glechoma sardoa Bég.
  2. (Z)-β-Ocimene from Two Species of Homoranthus (Myrtaceae).

I get EBib00050,Analysis of Essential Oils from Wild and Domesticated Plants of Glechoma sardoa Bég EBib0001,(Z)-β-Ocimene

when displayed in Textmate

This may be a problem of Excel and not the file itself. By default Excel does not use UTF-8 - you have to find how to import, e.g. https://www.nextofwindows.com/how-to-display-csv-files-with-unicode-utf-8-encoding-in-excel .

My first real error is EBib00078,Antiaflatoxigenic and antioxidant activity of an essential oil from Ageratum conyzoides L.,"Rajaram P Patil, Mansingraj S Nimbalkar, Umesh U Jadhav, Vishal V Dawkarc and Sanjay P Govindwarc",�10.1002/jsfa.3857,"VOL.90,608–614(2010)",Journal of Sci Food Agri

However the way to solve this is probably to import the titles and authors from Crossref or to handedit. DO NOT USE WINDOWS SOFTWARE (Word, Notepad, Excel) as it universally uses uncommon encodings.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/90?email_source=notifications&email_token=AAFTCS5TIGX2NFZPYLRWRJDQDJX7HA5CNFSM4II47EL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3XNR2A#issuecomment-518969576, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYGY2B4BTUXBVUVGBLQDJX7HANCNFSM4II47ELQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 5 years ago

checking that there are 1752 entries. Are there any ambiguities?

On Wed, Aug 7, 2019 at 1:37 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

On Wed, Aug 7, 2019 at 8:06 AM Ambarish Kumar notifications@github.com wrote:

Sir, Character encoding is not as of Unicode/UTF-8.

e.g -

  1. Analysis of Essential Oils from Wild and Domesticated Plants of Glechoma sardoa Bég.
  2. (Z)-β-Ocimene from Two Species of Homoranthus (Myrtaceae).

I get EBib00050,Analysis of Essential Oils from Wild and Domesticated Plants of Glechoma sardoa Bég EBib0001,(Z)-β-Ocimene

when displayed in Textmate

This may be a problem of Excel and not the file itself. By default Excel does not use UTF-8 - you have to find how to import, e.g. https://www.nextofwindows.com/how-to-display-csv-files-with-unicode-utf-8-encoding-in-excel .

My first real error is EBib00078,Antiaflatoxigenic and antioxidant activity of an essential oil from Ageratum conyzoides L.,"Rajaram P Patil, Mansingraj S Nimbalkar, Umesh U Jadhav, Vishal V Dawkarc and Sanjay P Govindwarc",�10.1002/jsfa.3857,"VOL.90,608–614(2010)",Journal of Sci Food Agri

However the way to solve this is probably to import the titles and authors from Crossref or to handedit. DO NOT USE WINDOWS SOFTWARE (Word, Notepad, Excel) as it universally uses uncommon encodings.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/90?email_source=notifications&email_token=AAFTCS5TIGX2NFZPYLRWRJDQDJX7HA5CNFSM4II47EL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3XNR2A#issuecomment-518969576, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYGY2B4BTUXBVUVGBLQDJX7HANCNFSM4II47ELQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

No sir. There is no ambiguity related to DOI mapping to title, author and journal using Crossref API.

Yes, there are 1752 entries.

petermr commented 5 years ago

Thank you, I agree that bibliography is now close to finalized. The top priority is now to create a profile table and link to the others.

ambarishK commented 5 years ago

Yes sir.

ambarishK commented 5 years ago

Sir,

Only two records are there which has diamond mark at the beginning of their DOI. It can be hand-edited.

EBib00078 | Antiaflatoxigenic and antioxidant activity   of an essential oil from Ageratum conyzoides L. | Rajaram P Patil, Mansingraj S Nimbalkar,   Umesh U Jadhav, Vishal V Dawkarc and Sanjay P Govindwarc | �10.1002/jsfa.3857 | VOL.90,608–614(2010) | Journal of Sci Food Agri
EBib000288 | Chemical Composition of Artemisia   absinthium L. from Greece | A. Basta, O. Tzakou, M. Couladis & M.   Pavlović | �10.1080/10412905.2007.9699291 | VOL. 19, 316-318 (July/Aug 2007) | Journal of Essential Oil Research
petermr commented 5 years ago

Thank you, do you know what the problematic characters are? are they printing/nonprinting? My guess us that they will be spaces or punctuation. (They may occur in other files)

Which table did you create bibliography from?? That can be a basis for the Profile table. Is it on Github?

On Thu, Aug 8, 2019 at 12:42 PM Ambarish Kumar notifications@github.com wrote:

Sir,

Only two records are there which has diamond mark at the beginning of their DOI. It can be hand-edited.

EBib00078 | Antiaflatoxigenic and antioxidant activity of an essential oil from Ageratum conyzoides L. | Rajaram P Patil, Mansingraj S Nimbalkar, Umesh U Jadhav, Vishal V Dawkarc and Sanjay P Govindwarc | �10.1002/jsfa.3857 | VOL.90,608–614(2010) | Journal of Sci Food Agri

EBib000288 | Chemical Composition of Artemisia absinthium L. from Greece | A. Basta, O. Tzakou, M. Couladis & M. Pavlović | �10.1080/10412905.2007.9699291 | VOL. 19, 316-318 (July/Aug 2007) | Journal of Essential Oil Research

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/90?email_source=notifications&email_token=AAFTCS3NCNBEWANAVENWH3LQDQBA3A5CNFSM4II47EL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33LAHY#issuecomment-519483423, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYQV67AOMCHPLGAEFTQDQBA3ANCNFSM4II47ELQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir,

These are extra introduced to DOI. As I go for finding these title over web, searched article does not has any such character which is appearing problematic here.

e.g. -

J Sci Food Agric. 2010 Mar 15;90(4):608-14. doi: 10.1002/jsfa.3857.
Antiaflatoxigenic and antioxidant activity of an essential oil from Ageratum conyzoides L.
Patil RP1, Nimbalkar MS, Jadhav UU, Dawkar VV, Govindwar SP
Chemical Composition of Artemisia absinthium L. from Greece
A. Basta , O. Tzakou , M. Couladis  & M. Pavlović
Pages 316-318 | Received 01 Oct 2005, Accepted 01 Feb 2006, Published online: 28 Nov 2011
Download citation  https://doi.org/10.1080/10412905.2007.9699291

Bibliography information is extracted from plant info table.

ambarishK commented 5 years ago

Bibliography table with unique records.

Records are made unique based on title value.

petermr commented 5 years ago

Thank you, This will need linking into Profile table at some stage.

On Mon, Aug 12, 2019 at 12:16 PM Ambarish Kumar notifications@github.com wrote:

[Bibliography( https://github.com/gilienv/EssOilDB/blob/master/tables/bibliography/bibliographyFinal050819uniqueTitle.csv) table with unique records.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gilienv/EssOilDB/issues/90?email_source=notifications&email_token=AAFTCS7YF7EBHFQYKBOWXHLQEFBALA5CNFSM4II47EL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4CHBQA#issuecomment-520384704, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS5PXJYAPTCFZ2DD73LQEFBALANCNFSM4II47ELQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK