rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Define lacking articles in Zoologica, New York Zoological Society #28

Closed suwiding closed 7 years ago

suwiding commented 8 years ago

@rdmpage My enthusiasm about the definition of articles in Kirtlandia inspired me to finish the work that I was doing on Zoologica. I got complete article metadata from the librarians at the Wildlife Conservation Society and cleaned it up. I also removed all citations that are already defined in BHL in order to prevent duplication. I did this by pulling all articles defined in Zoologica in BHL in EndNote format using the BHL APIs and then importing those citations into EndNote. I could spot the missing articles fairly easily once everything was in EndNote. Again, there's an RIS file inside the zip. zoologica_nodups.txt.zip

rdmpage commented 8 years ago

@suwiding I've run this file and it adds a lot more articles :) There will be some issues with some of the articles, especially missing plates which I haven't had time to fix. If you want to see a visualisation of coverage the old BioStor has a nice display http://direct.biostor.org/issn/0044-507X One day I hope to port this to the new BioStor.

suwiding commented 8 years ago

@rdmpage I'm happy that so many new articles were defined but I'd like to achieve complete article definition. What's the best way to handle the articles that weren't defined? Here's one example: Beebe, William. Racket formation in tail-feathers of the motmots. Zoologica 1 (5) 139-149. 1910 Would an RIS file including only the missing articles with the BHL starting page number in the notes field (or another field), help?

rdmpage commented 8 years ago

@suwiding Yes, a list of the missing references with BHL pageID in the notes field would certainly make things easier. A problem with Zoologica is that there's another journal with the same name in BHL (published in Germany), which complicates things. In some cases, such as the Beebe paper, the metadata doesn't match the article, see http://biodiversitylibrary.org/page/31057840

suwiding commented 8 years ago

OK. I'll get a list together but this will take a few days. Thanks.

On Wed, Nov 9, 2016 at 12:49 PM, Roderic Page notifications@github.com wrote:

@suwiding https://github.com/suwiding Yes, a list of the missing references with BHL pageID in the notes field would certainly make things easier. A problem with Zoologica is that there's another journal with the same name in BHL (published in Germany), which complicates things. In some cases, such as the Beebe paper, the metadata doesn't match the article, see http://biodiversitylibrary.org/page/31057840

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/28#issuecomment-259478234, or mute the thread https://github.com/notifications/unsubscribe-auth/AGaJcUdyCVK5peXWnmMQmZuVCAyO7S_dks5q8gfHgaJpZM4KqslG .

rdmpage commented 8 years ago

Great. I realise that this is a tedious process. I’d love to find the time to put together a nice interface for the article extraction process and metadata editing. I have two tools I use to talk to the old BioStor database, including a simple graphical tool for choosing pages when pagination is discontinuous (e.g., plates) but this isn’t really ready for prime time - it generates SQL that I have to cut and paste :O. Oh for a week or two to do nothing else but this task.

On 9 Nov 2016, at 18:05, suwiding notifications@github.com wrote:

OK. I'll get a list together but this will take a few days. Thanks.

On Wed, Nov 9, 2016 at 12:49 PM, Roderic Page notifications@github.com wrote:

@suwiding https://github.com/suwiding Yes, a list of the missing references with BHL pageID in the notes field would certainly make things easier. A problem with Zoologica is that there's another journal with the same name in BHL (published in Germany), which complicates things. In some cases, such as the Beebe paper, the metadata doesn't match the article, see http://biodiversitylibrary.org/page/31057840

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/28#issuecomment-259478234, or mute the thread https://github.com/notifications/unsubscribe-auth/AGaJcUdyCVK5peXWnmMQmZuVCAyO7S_dks5q8gfHgaJpZM4KqslG .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/28#issuecomment-259482022, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFaja3lWL51_kmSQztHvvshPJ74AoYks5q8gt3gaJpZM4KqslG.

suwiding commented 8 years ago

I got help from some lovely volunteers at NYBG. Attached is a zip file for 83 articles which were not defined in BioStor and BHL on the previous attempt. There are 83 entries in all. BHL starting page number is in the notes field in the RIS metadata. Zoologica_missing_83.txt.zip

rdmpage commented 8 years ago

@suwiding Great, thanks for this. Looking at the file the RIS fields are a little different to what I'm used to (e.g., month and day stored in the "Y2" field), so I'll have to tweak my code a little. What software was used to create this file?

suwiding commented 8 years ago

I initially received the article metadata from the Bronx Zoo library as an EndNote file with an extension of enl. I'm not sure what version of EndNote the references were created in. I loaded it into EndNote X7 for Mac and after the import the data that you're looking at, e.g. '08/17', the month and date of publication, appeared in the EndNote field labeled 'Access Date'. I did a lot of work on the references in EndNote and finally exported from EndNote to an RIS file. I didn't do any customization of the RIS data during the export. (Maybe I should do some customization.) I'm sorry to say that I didn't pay much attention to this field either at import or export time. I don't know if the librarian who entered the references into EndNote put MM/DD into the Access Date field intentionally or if this is an artifact of how the data was manipulated. I'm happy to manipulate the data in EndNote and then export it again if that would help.

suwiding commented 8 years ago

I looked at some RIS documentation at http://referencemanager.com/sites/rm/files/m/direct_export_ris.pdf. It makes sense to me that the PY field be in YYYY format and the DA field be in YYYY/MM/DD format. Also, the data that I had in the Access Date made no sense to me so I got rid of it. Here a file with all of those changes made. zoo_change_date.txt.zip

rdmpage commented 8 years ago

@suwiding I've run the file and added the additional articles. I've not checked for articles with missing plates, but I'm hoping that we now have pretty complete coverage for this journal. Thanks again for the RIS files with BHL PageIDs, they make things much easier.

suwiding commented 8 years ago

@rdmpage The last group looks good. The only thing missing now is volume 30. The digitized volume isn't in BHL yet. I'll continue to monitor this and will pass the corresponding articles to you after the content is added to BHL.

rdmpage commented 8 years ago

@suwiding Great, I’ll close this issue and we can revisit it when volume 30 is scanned.

On 23 Nov 2016, at 13:13, suwiding notifications@github.com wrote:

@rdmpage https://github.com/rdmpage The last group looks good. The only thing missing now is volume 30. The digitized volume isn't in BHL yet. I'll continue to monitor this and will pass the corresponding articles to you after the content is added to BHL.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/28#issuecomment-262509323, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFake4kvgCmaziTqkBzJUs8MwITQGSks5rBDwEgaJpZM4KqslG.

rdmpage commented 7 years ago

@suwiding Volume 30 is now in BHL, do you have the metadata for the articles in this volume?