rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Add articles for Contributions in Science ISSN 0459-8113 #45

Closed rdmpage closed 7 years ago

rdmpage commented 7 years ago

Lots of articles already in BioStor, but there doesn't seem to be an easily accessible list of articles from this journal.

suwiding commented 7 years ago

@rdmpage The Natural History Museum of Los Angeles County is a BHL member. I or another member of the EABL project team will contact them directly to see if they have complete article metadata to give us. Where did you get the metadata for the articles that you already have? Just curious and trying to learn...

rdmpage commented 7 years ago

Hi Susan, The metadata for the article comes from several sources:

  1. My BioNames project has a lot of article data derived from http://organismnames.com For those articles I generate a RIS file and then run that through BioStor. This data lacks authors, which I then add manually.
  2. I've scraped http://www.nhm.org/site/research-collections/research-tools/publications for a list of titles, I'm now writing code to compute page ranges from BHL. For cases where there is more than one article in the same BHL item, and each article has the same starting page number (the hardest case for BioStor) I'm adding BHL PageIDs to the records.I'm also adding authors and dates from the BHL scans. Rod

Get Outlook for iOS

    _____________________________

From: suwiding notifications@github.com Sent: Wednesday, January 4, 2017 11:59 pm Subject: Re: [rdmpage/biostor] Add article for Contributions in Science ISSN 0459-8113 (#45) To: rdmpage/biostor biostor@noreply.github.com Cc: Roderic Page rdmpage@gmail.com, Mention mention@noreply.github.com

@rdmpage The Natural History Museum of Los Angeles County is a BHL member. I or another member of the EABL project team will contact them directly to see if they have complete article metadata to give us. Where did you get the metadata for the articles that you already have? Just curious and trying to learn...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rdmpage commented 7 years ago

@suwiding To see what progress I've made so far here's a Google Docs spreadsheet I'm using to extract articles https://docs.google.com/spreadsheets/d/1d2FOxKiNnGDf2rt36-x-iIuCtj-tJZln96hZjrkgGiY/edit?usp=sharing see also http://direct.biostor.org/issn/0459-8113

marissakings commented 7 years ago

Hi @rdmpage, here are the citations for Contributions in Science articles not already in BHL. We used the Notes field to list the starting page ID number and the Database Provider field to indicate that NHMLAC is the contributor. Please let me know if there are any issues with the file. Thanks!

Corrected_Contributions in Science Citations.txt

rdmpage commented 7 years ago

@marissakings Many thanks for this. A couple of minor things. There doesn't seem to be a field for the journal, something like

JO - Contributions...

Also, ideally the first and last pages would be in separate fields, so that the first page is

SP - first page EP - last page

I can tweak my code to handle both pages in the SP field, and I know that some programs put both pages in the SP field.

None of these issues is a show stopper, so I'll look at adding these articles as soon as I can. Many thanks for putting together this list.

Get Outlook for iOShttps://aka.ms/o0ukef


From: marissakings notifications@github.com Sent: Saturday, July 15, 2017 12:41:59 AM To: rdmpage/biostor Cc: Roderic Page; Mention Subject: Re: [rdmpage/biostor] Add articles for Contributions in Science ISSN 0459-8113 (#45)

Hi @rdmpagehttps://github.com/rdmpage, here are the citations for Contributions in Science articles not already in BHL. We used the Notes field to list the starting page ID number and the Database Provider field to indicate that NHMLAC is the contributor. Please let me know if there are any issues with the file. Thanks!

Corrected_Contributions in Science Citations.txthttps://github.com/rdmpage/biostor/files/1149790/Corrected_Contributions.in.Science.Citations.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rdmpage/biostor/issues/45#issuecomment-315484550, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAFFaoXc0bNH64FTgcmO52cBc47JCNHcks5sN-63gaJpZM4Laz93.

rdmpage commented 7 years ago

@marissakings I've started to add some of these articles, see http://biostor.org/issn/0459-8113/year/1963 I've discovered another "gotcha", namely where to the article starts. Some articles have the cover being page 1, some start numbering the article a couple of pages in. It looks like the file always has the BHL pageID for the cover, which may or ma not be the start of the article. BioStor relies on the page numbering matching the actual pages we need to extract, so this can lead to some problems. I think I can work around this by offsetting the PageID (the N1 field) where needed. I'll let you know what I've processed all the articles in the file.

marissakings commented 7 years ago

Oh dear. It sounds like the citations aren't as consistent as I thought they were. I've also found after reviewing what is already in BHL that I left two articles off of the file I sent you (volumes 507 and 508). Since there are gaps with the Journal and Start/End Pages fields, would it be best if I just re-uploaded a corrected version of the text file with all of the citations?

rdmpage commented 7 years ago

@marissakings No need to redo everything, I can add those two manually.

suwiding commented 7 years ago

@rdmpage I feel the need to confess that most of the problems are due to 'help' that I provided to @marissakings . I apologize for the problems.

rdmpage commented 7 years ago

@suwiding No worries, the journal seems unable to make up it's mind where to start its pagination, so it makes things "interesting". Having the file @marissakings sent is a big help (having the BHL PageIDs saves lots of time), and I'm slowly working through it to add all the articles.

marissakings commented 7 years ago

@suwiding I wouldn't have known where to start if not for all of your help, especially with Python! @rdmpage, I can upload a text file with the two missing citations if that would help?

rdmpage commented 7 years ago

@marissakings Yes, having the two missing citations the would be great.

marissakings commented 7 years ago

CIS Citations 507_508.txt @rdmpage Here are the citations for 507-508.

rdmpage commented 7 years ago

@marissakings and @suwiding No doubt a closer look will uncover some issues, but I think we're pretty much done. Thanks for all your help, it made a huge difference having the metadata available.

marissakings commented 7 years ago

Thanks for making this happen @rdmpage!

For future additions (NHMLAC has a few more publications that may get added, and we'd be creating records from scratch again), I just wanted to confirm all of the fields that we would need to have entered - I've attached a sample file for the CIS article Notes on a Brazilian mouse, Blarinomys breviceps (Winge) by Abravaya and Matson. We're currently using the free web version of EndNote to generate the text files in RIS format, and there isn't an End Page field, so I put both the start and end page in the Start Page field. I'm also guessing that if both the start pages and end pages are in the SP field, we wouldn't need to include the start page in the Notes field? Sample RIS.txt

rdmpage commented 7 years ago

@marissakings Just to be clear, the SP field has the start and end pages for the printed version

SP  - 1-8

and the SE field (which I've not seen before) has the start and end BHL PageIDs of the corresponding pages.

SE  - 52335475-52335482

if so, this would be fine for BioStor.

depending on how detailed you want to be, adding month and day to the PY field would be handy, taxonomists in particular like precise dates.

marissakings commented 7 years ago

@rdmpage - Ah, ok - EndNote has different fields for pages and start pages - attached is a screenshot of the core fields we used when creating a record. If using both the SP and SE fields are fine for BioStor, I'll make a note to do that in the future. I'll make a note to add both day and month as well if that information is handy. sample ris_endnote

marissakings commented 7 years ago

Hi @rdmpage, it looks like BHL harvested the remaining articles overnight. I spotted a few that didn't get uploaded - should I create a new text file with those citations to re-upload?

There was also one strange sorting issue - volume 41 is being displayed between volumes 46 and 48 in BHL (47 is one of the missing volumes). Do you or @suwiding know if this is a problem with the citation or something on BHL's end?

suwiding commented 7 years ago

@marissakings I fixed the ordering problem with volume 41 in BHL using the Admin Dashboard. I don't know why it happened.