rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Phytologia 0031-9430 and a question #32

Closed suwiding closed 7 years ago

suwiding commented 7 years ago

@rdmpage I want to be sure that there's no duplication of effort between you and the team that I work with when gathering citations/references. (We did some work to create article level metadata for Quaestiones Entomologicae and I just noticed that you pulled most if not all of the articles for that publication in from elsewhere. Did you pull the citations/references in from Wikidata? If so would you be willing to share the technique that you used with me?) I got what I think is a complete list of citations for Phytologia from the Index of American Botanical Literature. We're happy to fill the article definition gaps for this publication but don't want to invest the time if you're on the verge of filling the gaps in a different way. Are you working on Phytologia or should we go ahead and identify the gaps? Thanks for letting me know.

rdmpage commented 7 years ago

@suwiding I screen scraped Quaestiones Entomologicae which gave me most of the details except for last page, which I computed from the starting page o the next article, then did some manual editing. Like most scraping, the techniques are general but the code is site-specific.

I've done a bit of work on Phytologia but feel free to identify gaps. I've been using Web of Science and screen scraping to get metadata.

There are a couple of ways we could avoid duplication. One approach would be for you and/or you team to open an issue here for each journal you're working on. That way we track progress, and I know not to duplicate what you're doing.

Another might be to have a centralised repository of article-level metadata where we can store RIS files. We could use issues to say "I'm working on this journal", and use GitHub to add data as and when it's ready (either by contributing to the repository, or by "forking" the repository and issue pull requests. I've started a repository for some RIS files here: https://github.com/rdmpage/journal-articles-ris This is a tiny fraction of the data I've accumulated over time, I hope to populate it more heavily when I get the chance.

If you'v ever used something like http://trello.com then that is another way we could work together. There's a trello "clone" that I've linked to this repository https://waffle.io/rdmpage/biostor We could use this to move issues for journals into columns such as "in progress", etc.

Open to suggestions of other ways to coordinate things (I've pretty much always been a team of one). If you want to chat about this I'm happy to Skype or FaceTime.

trosesandler commented 7 years ago

I'd prefer we stick with github for now since we've already started using that and see if it has enough functionality to be able to communicate status of what we're working on. I second the idea of us opening an issue for each journal we are working on. Susan and I had been waiting to do that after we acquired and normalized the citations but I think it makes sense to do that as soon as we start working on it so we aren't duplicating your efforts Rod. Rod could you also open tickets when you begin working on journals? So far I have been attaching the files when they are ready to the journal issue but sounds like you want us to post them to https://github.com/rdmpage/journal-articles-ris? Should we then link record in the repository back to the issue?

rdmpage commented 7 years ago

OK, let's stay with github. Opening an issue for each journal makes sense. Adding RIS files to https://github.com/rdmpage/journal-articles-ris makes sense as well, that way we can grow an archive that may be useful for other tasks.

rdmpage commented 7 years ago

@trosesandler Oh, and if a journal has DOIs, is in JSTOR, in SciElo, or is published using PKP then there are ways to get article-level metadata that is likely to be better quality than WoS.

trosesandler commented 7 years ago

Ok then I'll need some pointers on how to go about checking for that and downloading them. I don't have access to JSTOR and I'm not sure what PKP is

rdmpage commented 7 years ago

Even if you don’t have a license to access JSTOR articles you should still be able to browse it. Often a Google search for an article title plus “jester” will tell you if the journal is in JSTOR. You can also browse by title http://www.jstor.org/action/showJournals?browseType=title http://www.jstor.org/action/showJournals?browseType=title

PKP, sorry, I was in a rush. That is the Open Journal System used by lots of smaller journals (and some large ones such as Zootaxa, see http://biotaxa.org/Zootaxa http://biotaxa.org/Zootaxa ). The interface is pretty easy to recognise, even if it is customised. Open Journal System has an OAI-PMH interface that can be queried, so one can (usually) extract metadata, albeit often in a coarse Dublin Core format.

On 1 Dec 2016, at 18:07, trosesandler notifications@github.com wrote:

Ok then I'll need some pointers on how to go about checking for that and downloading them. I don't have access to JSTOR and I'm not sure what PKP is

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/32#issuecomment-264247730, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFaq5fIt9pmU132c0lgxYo58ue8oCZks5rDwzkgaJpZM4K6zTs.

trosesandler commented 7 years ago

Ok that makes sense. What about DOIs? Is the best way to verify if it has DOIs to search for the journal in CrossRef? When I do that for Raptor Research it does bring up some articles (4,022,471 results) but its not clear to me how to further filter the results or how to download the citation in bulk as you can do in WOS.

rdmpage commented 7 years ago

If you find an article in http://search.crossref.org http://search.crossref.org/ from a journal, then you can add that info to the github issue. I have code that can harvest metadata for individual journals based on knowing their ISSN.

On 1 Dec 2016, at 19:27, trosesandler notifications@github.com wrote:

Ok that makes sense. What about DOIs? Is the best way to verify if it has DOIs to search for the journal in CrossRef? When I do that for Raptor Research it does bring up some articles (4,022,471 results) but its not clear to me how to further filter the results or how to download the citation in bulk as you can do in WOS.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/32#issuecomment-264269021, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFatgkhsVJ0Q4_p5bZZCsTqY5VPYE7ks5rDx--gaJpZM4K6zTs.

rdmpage commented 7 years ago

Is there a list somewhere of the journals BHL wants articles for? That way I could look ahead and see what ones I may already have metadata for.

On 1 Dec 2016, 19:28 +0000, trosesandler notifications@github.com, wrote:

Ok that makes sense. What about DOIs? Is the best way to verify if it has DOIs to search for the journal in CrossRef? When I do that for Raptor Research it does bring up some articles (4,022,471 results) but its not clear to me how to further filter the results or how to download the citation in bulk as you can do in WOS.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/rdmpage/biostor/issues/32#issuecomment-264269021), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAFFatgkhsVJ0Q4_p5bZZCsTqY5VPYE7ks5rDx--gaJpZM4K6zTs).

crowleyb commented 7 years ago

BHL does not have a list but @trosesandler and Susan Lynch have been working on getting articles for a number of in-copyright titles digitized through the grant project they are working on called "EABL" - perhaps they can answer...

suwiding commented 7 years ago

@rdmpage I want to give you the status of the Phytologia citations. I have slightly fewer than 1300 citations ready to pass to you. They are all, or almost all, unique citations, i.e. they don't duplicate articles already defined in BioStor and BHL. Since, these citations include an RIS DP field (database provider) and this is the first time we've tried passing this data, we want to do it when Mike is around. Therefore, the plan is to pass about 20 citations as a test immediately after the holidays. If all goes well (and I expect it to!) the remaining citations will be passed soon afterwards.

rdmpage commented 7 years ago

@suwiding OK, sounds like a plan :)

suwiding commented 7 years ago

The attached zip file contains all of the articles currently lacking in Phytologia volume 1. (There are 17 in all.) The notes field contains the BHL start page number and the DP field contains 'Index of American Botanical Literature'. In this publication, the end of one article and the beginning of the next article share a page. Does this pattern cause problems for the BioStor code? phytologia_lacking_vol1.txt.zip

rdmpage commented 7 years ago

@suwiding Thanks, I've added these to BioStor. Finger's crossed the contributor field will be imported into BHL...

suwiding commented 7 years ago

@rdmpage Mike asked me to send another smallish batch of Phytologia citations through BioStor into BHL. Attached are citations for the articles missing from volume 2 of Phytologia. There are 29 new citations in the file.

phytologia_missing_v2.txt.zip

suwiding commented 7 years ago

@rdmpage We need to process another smallish batch of citations including an additional contributor. (DP field in the incoming RIS file). It's the missing articles from Phytologia volume 3, BHL item 46705.
phytologia_missing_v3.txt.zip

rdmpage commented 7 years ago

@suwiding OK, these are being added as I type this.

suwiding commented 7 years ago

@rdmpage Here's a smallish batch for newly added issues. Starting pageids supplied but no additional contributor (database provider). Phytologia_v98_v99.txt.zip

suwiding commented 7 years ago

@rdmpage Here are gap fill articles for Phytologia through v. 40. There are 407 in all. Database Provider (DP) is provided so there should be an extra contributor in BHL. Thanks!!

phytologia_feb_gaps.txt.zip

rdmpage commented 7 years ago

Many thanks @suwiding I've added the last two sets. Just noticed that 2015 (volume 97, http://www.biodiversitylibrary.org/item/201069 ) has no articles. Is that on the to do list? If not, I could look at extracting those from the web site.

suwiding commented 7 years ago

@rdmpage I've got the articles. Several pages of content were missing from this issue so I requested a replacement. I'll upload the articles as soon as the content is complete- hopefully soon. Thanks!

rdmpage commented 7 years ago

@suwiding Ah, I see.  Sent from my iPhone

On Thu, Feb 23, 2017 at 3:06 AM +0000, "suwiding" notifications@github.com wrote:

@rdmpage I've got the articles. Several pages of content were missing from this issue so I requested a replacement. I'll upload the articles as soon as the content is complete- hopefully soon. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

suwiding commented 7 years ago

@rdmpage I finally have the remaining articles from Phytologia including volume 97, 2015. There are 778 articles in all. The citations include database provider (Index of American Botanical Literature) and the BHL starting pageid in the notes field of the RIS. Please ingest with the version of the code that recognizes the DP field in the RIS. Thanks very much!

phytologia_gaps_mar_17.txt.zip

rdmpage commented 7 years ago

@suwiding I've just added them, many thanks!