mozillascience / global-sprint-2016

repo for planning of Global Sprint 2016, June 2-3
http://mozillascience.org/global-sprint-2016
Creative Commons Zero v1.0 Universal
33 stars 10 forks source link

Tool to assess accessibility of a list of references #50

Open abbycabs opened 8 years ago

abbycabs commented 8 years ago

[ Project Lead ] @JGDove99 [ GitHub Repo ] [ Track ] Looking to inspire someone to build such a tool [ Level ] Advanced [ Timezone ]

Description

Use-case: An author of a new scholarly article has assembled a list of references, possibly in Zotero or other bibliographic tool, possibly just in Word. This author is a supporter of Open Access and so wants to assess which of her/his referenced sources are going to be accessible to readers of her/his new article.

Possible approach: I understand that Google Scholar does not have an API that would allow one to build a tool that uses Google Scholar to search for open versions of each of the articles in a list of references. And apparently Google Scholar will detect and then prevent attempts to use screen-scraping to do so. But perhaps one of the other search engines like Bing or Baidu or something I've not heard of could do this.

Limiting the search to just articles with DOIs is not sufficient for this purpose, and CrossRef will usually just lead one to the "article of record" (often behind a paywall). For the purposes of this use-case, the author will want to know which of her/his referenced sources are accessible to an end-user free on the open web. Availability of any version of the article (so-called "green" OA) should be acceptible. Possible places the article might reside includes repositories (institutional or subject or government), departmental or personal websites, academia.edu, ResearchGate, etc.

The author, armed with information about which referenced sources they are using are still behind paywalls may want to contact those authors to urge them to provide open versions of their articles so readers of the new article will have access to the references sources.

Open Source code from the OA Button, or from http://dissem.in might be useful for some, but not all of this task. So might code from http://doai.io.


Want to Contribute?

Join us at the Global Sprint June 2-3. Leave a comment in this issue to let the project lead know you're interested in contributing during #mozsprint 2016!


Note to the Project Lead

Congrats, John! This is your official project listing for the Mozilla Science Global Sprint 2016. To confirm your submission, please complete the following:

Here are some exercises that will help your project be more inviting to new contributors. We hope you'll try to complete some of these as you prepare for #mozsprint.

If you complete all the exercises, your project will be eligible to be featured in our collection of open source science projects! Once you've finished this list, contact @acabunoc to submit your project for review.

abbycabs commented 8 years ago

Hey John! This sounds a lot like this project: https://github.com/mozillascience/global-sprint-2016/issues/13

JGDove99 commented 8 years ago

Similar, but taking it to a much broader level. #13 assesses what percentages of a reference list represent articles available on PubMed Central. That's one very important repository but primarily focused on medical science subjects. #50 would work across all disciplines and will most likely involve use of a general search engine (probably not Google Scholar because GS has no API and prevents screen-scraping) to find out if an article is anywhere on the web (repositories, publisher cites, author's cites like academia.edu or researchgate, etc.) and needs to find "green" versions of the article if the article of record is not open.

aleimba commented 8 years ago

I like the idea. Maybe you could also approach OA publishers, a small OA icon in their reference lists (next to OA references) might be nice. But I don't know if that is feasible.

brucellino commented 8 years ago

first of all, general comment - I don't think a single thread issue is going to be the right medium for this discussion, which is tres interesting. Take that as you may :smiley_cat:

Secondly, have you considered using the DataCite APIs to immplement this tool ? Resolving objects via DOI and then parsing the metadata (which should usually be entirely consumable) to extract cited objects via DOI should be a respectable and implementable baseline study.

JGDove99 commented 8 years ago

Andreas, I completely agree. LinkResolvers, which were the breakthru standard for ease-of-use access in the behind-the-fire-wall world of the 90s, have a feature which is an API which can pre-test whether or not clicking on the link is going to reveal full-text, on-line, access so the library-product can show a different icon if full-text is going to be available. [Saves enormous time when examining a reference list and you know you don't have time for any inter-library-loan requests.] What the user needs is a similar feature for links after a citation so that know which ones will resolve to free-full-text access even before clicking on it. I've just started having conversations with publishers about this, but first showing that it can be done would be a big help.

-john

On Tue, May 31, 2016 at 4:07 AM, Andreas Leimbach notifications@github.com wrote:

I like the idea. Maybe you could also approach OA publishers, a small OA icon in their reference lists (next to OA references) might be nice. But I don't know if that is feasible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozillascience/global-sprint-2016/issues/50#issuecomment-222619765, or mute the thread https://github.com/notifications/unsubscribe/ASsJWSdrjKbYdTqbpw5dk3FHWmwbN1K5ks5qG-xZgaJpZM4IpyBw .

JGDove99 commented 8 years ago
  1. Not all citations have DOIs.
  2. Even those that do, DOIs are under the control of the publisher of the article-of-record, and so for most traditional journals they rarely provide a path to a shared version of the article. 79% of scholarly publishers acknowledge that the author may share a version of their article, but they are not [yet] making it easy for researchers to find those shared articles. This is a topic I've brought up with Cross Ref.

-john

On Tue, May 31, 2016 at 4:15 AM, Bruce Becker notifications@github.com wrote:

first of all, general comment - I don't think a single thread issue is going to be the right medium for this discussion, which is tres interesting. Take that as you may 😺

Secondly, have you considered using the DataCite APIs to immplement this tool ? Resolving objects via DOI and then parsing the metadata (which should usually be entirely consumable) to extract cited objects via DOI should be a respectable and implementable baseline study.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozillascience/global-sprint-2016/issues/50#issuecomment-222621359, or mute the thread https://github.com/notifications/unsubscribe/ASsJWVO8uhHmNnpjAK9YJyvusOyXqFrWks5qG-4zgaJpZM4IpyBw .

bmkramer commented 8 years ago

Very interesting project - I'm running into this at the moment in a side project I'm doing, wanting to check a list of doi's (in casu Sci-Hub downloads) for green OA availability.

I won't be able to join this sprint project as I have my own to attend to :-) but will be following with interest.

One additional source (in addition to the dissem.in api and the OA button already mentioned) might be the DOAI initiative built on BASE: http://doai.io/

cvorland commented 8 years ago

You can scrape Google Scholar from a chrome extension- I do it for mine. Although if you load too frequently Google Scholar will still serve a captcha. https://chrome.google.com/webstore/detail/lazy-scholar/fpbdcofpbclblalghaepibbagkkgpkak It checks GScholar, PubMed, EuropePMC, and DOAI in search of free texts. Note: I've found that many of Google Scholar's "free texts" aren't always free, which is why I color the icon link yellow in the extension for lower confidence. I drop all the results in a database for each paper which could be queried by DOI, PMID, title, etc, but the number of papers is limited at this point.

What the user needs is a similar feature for links after a citation so that know which ones will resolve to free-full-text access even before clicking on it.<

This is what my extension does for PDFs, and you can be confident if full texts are free when linked on PubMed and EuropePMC based on html tags.

JGDove99 commented 8 years ago

bmkramer, DOAI is really neat. And does a lot of what's needed. I notice that it's run by CAPSH which also does DISSEM.IN, and is based on an academic search engine:

[image: BASE Logo] https://www.base-search.net/ which I've not heard of before. Thanks for this suggestion. -john dove

On Tue, May 31, 2016 at 8:05 AM, bmkramer notifications@github.com wrote:

Very interesting project - I'm running into this at the moment in a side project I'm doing, wanting to check a list of doi's (in casu Sci-Hub downloads) for green OA availability.

I won't be able to join this sprint project as I have my own to attend to :-) but will be following with interest.

One additional source (in addition to the dissem.in api and the OA button already mentioned) might be the DOAI initiative built on BASE: http://doai.io/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozillascience/global-sprint-2016/issues/50#issuecomment-222668967, or mute the thread https://github.com/notifications/unsubscribe/ASsJWcRaIqDxPgqY5WDXl5g6pZ_t8yStks5qHCQIgaJpZM4IpyBw .

blahah commented 8 years ago

Great project! BASE (already mentioned) and CORE (don't think mentioned yet) will cover a lot of the ground required to get it working I think.

JGDove99 commented 8 years ago

Richard, I think http://dissem.in uses CORE as part of its scoring of an author's published record. So that portion of testing accessibility could build on their code.

For some use cases the search for "accessible" needs to throw a wide net. This means departmental websites and sites like ResearchGate and Academia.edu. 

-john

Sent from my iPad.

On Jun 1, 2016, at 4:47 PM, Richard Smith-Unna notifications@github.com wrote:

BASE (already mentioned) and CORE (don't think mentioned yet) will cover a lot of this ground.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

blahah commented 8 years ago

@JGDove99 do you have a chat room where we can talk about this project? I have a load of ideas to offload before turning in for the evening :)

JGDove99 commented 8 years ago

How about WhatsApp? I'm at +1-781-964-2325.

-john

Sent from my iPad.

On Jun 2, 2016, at 1:34 PM, Richard Smith-Unna notifications@github.com wrote:

@JGDove99 do you have a chat room where we can talk about this project? I have a load of ideas to offload before turning in for the evening :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

blahah commented 8 years ago

@JGDove99 yes dissem.in does use CORE.

I think what you're describing is very much within the scope of dissem.in actually, and also within the scope of science fair. I'll describe below how I would go about solving the problem...

Briefly, I'd make a node.js module that, given some metadata about a scholarly document, tries to find all sources for that document and evaluate whether they are freely accessible. Also I would call it hypatia - because she was the last librarian at the library of Alexandria 📚.

hypatia would be plugin based, so each possible source of information or fulltexts would be a plugin. For example, you could start with the following plugins:

The simplest case would work like this:

  1. First check the crossref API to get complete metadata about the document, if it's there.
  2. If the metadata is there, check if the article is open access according to the license field
  3. If it's open access, download the fulltext using the URL provided by crossref
  4. If it's not open access, now resolve the same identifier via the doai.io resolver.
  5. Compare the resolved URL via doai.io and the one given by crossref to see if they are the same. If not, you've got a free version via doai.io that wasn't in crossref.

That will cover more papers than any other single approach I think.

Then we could go on adding more plugins that would extend the reach of hypatia:

Then hypatia could be integrated into any node.js project, or any web service.

All of this stuff fits well in the framework of tools we're developing at ContentMine, and at Science Fair, and what the OA button folks and Cottage Labs are working on.

Does it sound like this is the sort of thing you were imagining?

JGDove99 commented 8 years ago

Richard, Yes, you're on the right track. I've already worked with the dissem.in team to take a citation list in marked up format to score them for accessibility. And they are helping SJ Klein and myself on a project at MIT. However, there are some philosophical objections which keeps them from handling accessibility for some of the use cases I have in mind. For example they do not consider my nephew's article to be open. He shares it on Academia.edu. Google Scholar finds that article perfectly well. I tried DOAI with one of his articles and it fails to find the Academia.edu version as well.

I think we should arrange a Skype call in the next week or so, if you're able. It could be that there are ways ContentMine could help with this, or build on each other.

For the Sprint, I am hoping there is someone who can explore using Bing, Baidu, or something else for the last of your bullet points.

I won't be available tomorrow.
Perhaps we can talk next week.

Thanks for your suggestions already.

-John

Sent from my iPad.

On Jun 2, 2016, at 1:55 PM, Richard Smith-Unna notifications@github.com wrote:

@JGDove99 yes dissem.in does use CORE.

I think what you're describing is very much within the scope of dissem.in actually, and also within the scope of science fair. I'll describe below how I would go about solving the problem...

Briefly, I'd make a node.js module that, given some metadata about a scholarly document, tries to find all sources for that document and evaluate whether they are freely accessible. Also I would call it hypatia - because she was the last librarian at the library of Alexandria 📚.

hypatia would be plugin based, so each possible source of information or fulltexts would be a plugin. For example, you could start with the following plugins:

crossref doai.io The simplest case would work like this:

First check the crossref API to get complete metadata about the document, if it's there. If the metadata is there, check if the article is open access according to the license field If it's open access, download the fulltext using the URL provided by crossref If it's not open access, now resolve the same identifier via the doai.io resolver. Compare the resolved URL via doai.io and the one given by crossref to see if they are the same. If not, you've got a free version via doai.io that wasn't in crossref. That will cover more papers than any other single approach I think.

Then we could go on adding more plugins that would extend the reach of hypatia:

arxiv figshare ssrn zenodo CORE BASE dissem.in microsoft academic graph free text web searches looking for author / departmental sites Then hypatia could be integrated into any node.js project, or any web service.

All of this stuff fits well in the framework of tools we're developing at ContentMine, and at Science Fair, and what the OA button folks and Cottage Labs are working on.

Does it sound like this is the sort of thing you were imagining?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

blahah commented 8 years ago

ps I have to go and do parenting duty - will catch up again in the morning :)

blahah commented 8 years ago

@JGDove99 I think I would agree with the dissem.in team that an article posted on Academia is not open unless it's under an open license. If it's just posted there without an open license, it technically possible to read it but it isn't open.

I'm interested in the problem of providing pragmatic access where open access is not available, but it's the long tail of the problem and has much lower payoff... eventually it's much easier to just use SciHub (which I am carefully not endorsing).

JGDove99 commented 8 years ago

There are use cases where the world is a better place when a curious mind is not unnecessarily fettered from reading something that an author wants them to have free access to.

Those use cases also present the opportunity to educate a scholar about better ways to share. My main scheme and a couple others easily imagined allows both. SciHub is only STEM. The article my son needs (a four page article published by Cambridge University Press in 1994 for which they want 35 Euros for [salary for humanities instructors at Matej Bel University in Slovakia where he teaches are less than 1,000 euros a month]) is not in SciHub.

I think if we speak we'll find common ground, because I agree with you that either my nephew or academia.edu needs to change.

-john

PS: You might find this post interesting. It's about a lot more than SciHub. It has a good description of the use-case of an author assembling references for a paper to publish. And the Twitter trick using #icanhazPDF, is one that I've never heard of before. It successfully produced by email a copy of the Hungarian phonetics article for my son.

https://medium.com/@jamesheathers/why-sci-hub-will-win-595b53aae9fa%23.1cf8ka2yi

Sent from my iPad.

On Jun 2, 2016, at 2:29 PM, Richard Smith-Unna notifications@github.com wrote:

@JGDove99 I think I would agree with the dissem.in team that an article posted on Academia is not open unless it's under an open license. If it's just posted there without an open license, it technically possible to read it but it isn't open.

I'm interested in the problem of providing pragmatic access where open access is not available, but it's the long tail of the problem and has much lower payoff... eventually it's much easier to just use SciHub (which I am carefully not endorsing).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bmkramer commented 8 years ago

@JGDove99 @blahah As a step towards my own use case (Sci-Hub data), but possibly also of interest here, I've now cobbled together a (crude) R-script that queries the Dissemin API for a list of DOIs and returns a table with information on both OA availability (in so far as determined by Dissemin) and publisher policies regarding sharing preprint, postprint and publisher versions (also from Dissemin, sourced from SherpaRomeo).

Script and description here: Dissemin_API_R

And in reference to the discussion above, I have confirmed with the Dissemin/DOAI team that DOAI (not Disssemin) includes ResearchGate (but not Academia) in checking online availability: Twitter thread