rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Metadata for non-Hindawi Psyche articles #65

Open rdmpage opened 7 years ago

rdmpage commented 7 years ago

Issue from @jar398 originally posted as https://github.com/rdmpage/bionames/issues/27

The Hindawi scans (and the Hindawi DOIs) of Psyche cover about 95% of the corpus. I've prepared a CSV file that gives metadata for about 250 additional papers, ones that were not available to Hindawi when they scanned, but that are all available in BHL.

Notes on format

  • see header row for column meaning
  • there is some HTML markup, <i> for italic and &eaigu; and so on for special characters
  • authors are separated by semicolon (but do not treat semicolon as a separator in the title column)

I tried to attach the file to this issue, but github keeps saying "something went really wrong and we can't process that file". I've placed it here for you to pick up:

http://mumble.net/~jar/tmp/psyche-metadata-for-bionames.csv

jar398 commented 7 years ago

I don't know whether you've processed this or not, but I fixed some whitespace problems, did one or two other tweaks, and renamed the file (biostor instead of bionames). The changes are pretty minor, but the earlier version lacked authors for three or four articles. Here's the new URL:

http://mumble.net/~jar/tmp/psyche-metadata-for-biostor.csv

rdmpage commented 7 years ago

@jar398 More Pysche articles added but a few @BioDivLibrary volumes lack page numbers, so rest will need to be done manually (sigh).

jar398 commented 6 years ago

I now have BHL page lists for all the non-Hindawi articles (244 of them). I spent a fair amount of time on this, so while there are probably some mistakes, I don't think there's a lot.

Interested? It’s a csv file keyed by volume + first page + last page. I can put the information in some other format if doing so would be helpful. It’s intended to be used along with the metadata file I told you about earlier.

rdmpage commented 6 years ago

@jar398 Happy to take a look and see how easy it be to add these, CSV should be fine.

jar398 commented 6 years ago

OK. Pages lists are here: http://mumble.net/~jar/tmp/article-pages.csv Columns are volume, first page, last page, first 10 characters of title (needed in a few cases), semicolon-separated page list, number of 'guessed' pages, number of 'missing' pages. 'Guessed' pages are those that are included on the assumption that if volume page n maps to BHL page i, then volume page n+1 probably maps to BHL page i+1. This rule is only used if the BHL metadata file doesn't have page with volume page number n+1 indicated.

I only found one volume (14) in this set where the BHL volume page numbers are systematically wrong, and I corrected for this. There could be others. I also left out duplicate pages in volume 56.

I made a few small improvements to the metadata file that I gave you before: http://mumble.net/~jar/tmp/psyche-metadata-for-biostor.csv

jar398 commented 6 years ago

Any progress on this? I think it's important to get this information into BHL, since the only place it exists now is the CEC Psyche web site, which is not exactly archival.

On another subject, I'd be interested in adding your BHL page URLs to my master TOC. What do you recommend? All I really need is a table with DOIs and starting page ids, although some redundant metadata would be nice for consistency checking.

rdmpage commented 6 years ago

@jar398 I’m swamped at the moment so no, no progress yet.

On 8 Aug 2018, at 16:33, Jonathan A Rees notifications@github.com wrote:

Any progress on this? I think it's important to get this information into BHL, since the only place it exists now is the CEC Psyche web site, which is not exactly archival.

On another subject, I'd be interested in adding your BHL page URLs to my master TOC. What do you recommend? All I really need is a table with DOIs and starting page ids, although some redundant metadata would be nice for consistency checking.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/65#issuecomment-411449390, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFahVb3GSA1_813Tyw7XAXQPgeyGGPks5uOwTSgaJpZM4Nv2So.