rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Index The Journal of the East Africa and Uganda Natural History Society #24

Open trosesandler opened 8 years ago

trosesandler commented 8 years ago

Rod,

BHL has recently digitized some items for the Journal of the East Africa and Uganda Natural History Society. see http://www.biodiversitylibrary.org/bibliography/14163#/summary and we'd like to article-ize this content. The publisher, East Africa Natural History Society, gave us article citations for this content back in the Citebank days. I'm hoping we can use pass the citations on to you to index via BioStor. One of the challenges with the journal is the inconsistent numbering of volumes and issues. Also many of the volumes were bound together which means there will be multiple page 1s in a single item. I recall you saying this makes it more challenging for BioStor. I have attached some sample data for your review just to see if contains enough data to do the matching. Let me know your thoughts. JEANH_1910_1918.xlsx JEANH_1910_1918.xlsx

Trish

rdmpage commented 8 years ago

I've already done some articles for this journal, I think I made a start by harvesting the BHL part metadata and using that. If the data lacks pagination that will make things a bit more tedious, but it is still doable. The journal history looks complicated, there's a journal history here:

Depew, L., & Berghe, E. V. (1994, July). Journal History. Journal of East African Natural History. East African Natural History Society. http://doi.org/10.2982/0012-8317(1994)83[97:jh]2.0.co;2

0012-8317%281994%2983%5B97%3Ajh%5D2%2E0%2Eco%3B2.pdf

rdmpage commented 8 years ago

@trosesandler I'be had another look at the spreadsheets and they look good (the default iPad viewer mangled things so that I couldn't see the page numbers). I could certainly use these to add extra articles to BioStor.

rdmpage commented 8 years ago

@trosesandler BHL already has a lot of the metadata associated with the PDFs it has for this journal, so I can probably also just use that.

trosesandler commented 8 years ago

Hi Rod

Yes the journal made several name changes

BHL has only digitized the first one from 1910-1942. You said you already did some articles for this journal but I don't see those showing up in BHL so could you send me a link to what is complete?

The metadata I sent you was what the publisher sent to me several years ago when we had uploaded the PDFs via Citebank. Since much of the volume and issue data was missing in the columns I added that but it can also be parsed from the filename. Sounds like you are able to grab the metadata from the PDFs so you don't need me to send you spreadsheets right? Just wanted to clarify.

Trish

On Mon, Sep 26, 2016 at 4:22 AM, Roderic Page notifications@github.com wrote:

@trosesandler https://github.com/trosesandler BHL already has a lot of the metadata associated with the PDFs it has for this journal, so I can probably also just use that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/24#issuecomment-249521000, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBGTEFFxV4a-CHi5iHQDMcUZi-qEgks5qt47ngaJpZM4KFRJl .

rdmpage commented 8 years ago

@trosesandler Actually BHL has pretty much everything up until BioOne started publishing the journal, including Utafiti:

I'll take whatever metadata is available, so maybe if you send me whatever spreadsheets you have, and I'll also work with the PDF-related metadata.

trosesandler commented 8 years ago

Actually you're right I didnt' realize BHL had digitized almost all of it. In that case I'm attaching the full spreadsheet that was given to me several yrs ago. At that time I did some normalization on the filename and other tweaks. More recently I filled in the volume and issue columns since they were pretty empty. I added them by looking at the filenames and the actual content online. If you are able to parse the volume and issue values from the filename that might be faster JEANH_import_test.xlsx . Otherwise I can add them manually if that saves you some time.

trosesandler commented 8 years ago

Hi Rod, Just checking in to see how the spreadsheets were working for you. Were you able to make use of them for article-izing the content?

rdmpage commented 8 years ago

@trosesandler Making a slow start. I've imported the Excel spreadsheet into Google Docs and have extracted start and end pages from the column Pagination_in_host. There's going to be some manual work involved to find the articles :(

trosesandler commented 8 years ago

yep I figured because of the way things were bound. If I can help with any of the manual part let me know.

Trish

On Fri, Oct 7, 2016 at 11:30 AM, Roderic Page notifications@github.com wrote:

@trosesandler https://github.com/trosesandler Making a slow start. I've imported the Excel spreadsheet into Google Docs and have extracted start and end pages from the column _Pagination_inhost. There's going to be some manual work involved to find the articles :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/24#issuecomment-252298907, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBCR2ae2X9eVUPyOTUF7CfXG2UWTuks5qxnPCgaJpZM4KFRJl .

rdmpage commented 8 years ago

@trosesandler I've sent you an em,ail with an editable link to the spreadsheet I'm using to do the mapping https://docs.google.com/spreadsheets/d/1czlyL-WAApnxBxZLX2kIq1OmqIQ5ICVKY5C-8wLjjf4/edit?usp=sharing

Progress (both automated and manual) is here: http://biostor.org/issn/0012-8317/year/1970

trosesandler commented 8 years ago

Rod

Thanks for sharing this with me. This does look like alot of manual work! In order for me to help with some of it I need to understand a few things. 1) what is the relationship between the spreadsheet and the progress page? In some cases that are in sync but in others they are not. e.g. the article "A New four-toed mongoose from Kenya, Bdeogale Crassicauda Nigrwscens, ssp. nov." shows as being completed on the progress page but in the spreadsheet the BHL page id is blank 2) how much of the articles for which we have BHL page ids were done manually and which were done automatically? I'm trying to understand why it succeeds sometimes and fails others 3) at what point would it be useful for me to manually find the page ids? I wasn't sure which of the articles you've tried to automatically match so far and which failed.

thanks!

On Fri, Oct 7, 2016 at 4:11 PM, Roderic Page notifications@github.com wrote:

@trosesandler https://github.com/trosesandler I've sent you an em,ail with an editable link to the spreadsheet I'm using to do the mapping https://docs.google.com/spreadsheets/d/1czlyL- WAApnxBxZLX2kIq1OmqIQ5ICVKY5C-8wLjjf4/edit?usp=sharing

Progress (both automated and manual) is here: http://biostor.org/issn/0012-8317/year/1970

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/24#issuecomment-252362137, or mute the thread https://github.com/notifications/unsubscribe-auth/AG6pBJhJ7L5fILxA5pdeZ2eiIsNTzsd1ks5qxrV8gaJpZM4KFRJl .

rdmpage commented 8 years ago

Hi Trish,

Sorry for the lack of documentation :O

  1. The spreadsheet has no relation to the progress page, I haven't had a chance to a column saying which ones I've already found.
  2. In the spreadsheet all the PageIDs were found manually. The last volume has multiple article with the same start page, which is problematic for BioStor at the moment. If an article's starting page within an item is unique, or the OCR text is clear, BioStor can usually match the article. Sometimes there are cases where the same page (e.g., "1") appears in the same item, but only one of the pages is actually numbered (e.g., one is labelled "Page 1", the other is blank). These cases are bad, because BioStor will try and match to the numbered page.
  3. I'll look at adding PageIDs for the known articles, so that any blanks are PageIDs that we need to find. At that point you could add some manually if you like. I'm not expecting you to do this, just wanted you to see what's involved in cases where the mapping between articles and BHL is less than straightforward.
trosesandler commented 8 years ago

Rod Ok then I will wait until you've added the PageIDs to the spreadsheet and then I'm happy to help with the manual work - just let me know. For the EABL project this is my primary role - to figure out how we can increase our article-ization of BHL content so whether I share citations with you and you are able to automate it or whether I do it manually- both get us towards that goal.

trosesandler commented 8 years ago

Hi Rod
Just checking to see how the article-ization is coming for this journal and where I can be of assistance.

rdmpage commented 8 years ago

@trosesandler I'm swamped at the moment so haven't made any more progress on this.