rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
5 stars 2 forks source link

Kirtlandia, BHL bib 121359 #27

Closed suwiding closed 7 years ago

suwiding commented 7 years ago

Hi Rod, I apologize for not being in touch sooner. We got complete article metadata for Kirtlandia from the library at the Carnegie Museum of Natural History. I did a lot of scrubbing of the data in Open Refine including some rudimentary standardization of the author names. I then dumped the metadata out of OpenRefine as a tsv, loaded it into EndNote and dumped it out of EndNote as an RIS file. Finally I turned it into a zip file because that seems to be a requirement for adding the metadata to a Github issue. I noticed that there was a single Kirtlandia article already defined in BioStor and BHL so I removed it from the attached metadata.

Please let me know if this metadata is all right or if you can recommend ways to improve it. The other thing that occurs to me is that the EABL team could share a cloud-based EndNote library with you and you could grab metadata from there.

As always, thanks very, very much. Susan Lynch kirtlandia6.txt.zip

rdmpage commented 7 years ago

@suwiding Thanks for this, I'll take a look ASAP. Access to a cloud-based Endnote library might be useful.

I'm wondering whether we could also look at having a GitHub repository for RIS files, a little like the one for CSL files https://github.com/citation-style-language/styles I am accumulating lots of RIS files for BioStor and BioNames-related work, and need a place to store them. I used to use Mendeley, but it started to be a memory hog, and it's not an open solution. I could create a simple repository of RIS files, and then we could either:

(a) give BHL folks access so they can add RIS files by committing them (b) BHL could "fork" the repository, add files and/or modify existing files, then issue a "pull" request to have those changes added to the repository.

By doing this we keep everything in the open and anyone can take part. Plus I get to store all these RIS files in a place where others might find them useful.

rdmpage commented 7 years ago

@suwiding Preliminary extraction here: http://biostor.org/issn/0075-6245/year/2010 Haven't had a chance to check that everything is there.

suwiding commented 7 years ago

@rdmpage I took a close look at the results. 83 articles in the original list were defined and 43 weren't. I created another RIS file containing only the missing articles. As an experiment, I added the BHL starting page number to the notes field in the RIS metadata. You can tell me whether or not this is useful. I'm delighted that so many articles were defined but would like to understand why some articles weren't processed successfully. Perhaps there's something we can do on the BHL side to increase the success rate. I noticed that the page level metadata in BHL for Kirtlandia doesn't identify implicit page numbers and I wonder if this creates problems for the BioStor code. If this is a problem, the EABL team can pursue it by editing the page level metadata ourselves or by asking the contributing library to edit it. I'll attach another zip containing RIS metadata for the missing articles. Kirtlandia_missing1.txt.zip

rdmpage commented 7 years ago

@suwiding Thanks for taking the time to check this, I confess I was being a little lazy as I just ran an automated script without much manual fussing. As you suspect, the problem seems to be the lack of "Page 1" in the BHL metadata (page numbering often starts with page 2). This causes problems for BioStor. It can cope if there's only one "page 1" in an item, but not if there are multiple missing page 1's.

rdmpage commented 7 years ago

@suwiding OK, I think we've got them all now...

suwiding commented 7 years ago

@rdmpage I believe that all articles are defined now! Thank you! I noticed one small problem, which you may or may not want to fix depending on how fussy you are. For BioStor article 193169, there are extraneous double quotes in the title. I didn't notice this when I worked on the metadata. Sorry. I'm getting a much better understanding of what I need to pay attention to.

rdmpage commented 7 years ago

Great, thanks for checking. I've fixed the double quotes.