Show that open software in a paper brings more impact

jure commented 9 years ago

The idea was proposed yesterday by @npch, who wants to show that including software in your paper in an open manner (open license, link to repository) is correlated with that paper getting more citations, downloads, views, etc.

This would be a similar undertaking to what @hpiwowar and @tjv have done for open data: https://peerj.com/articles/175/

Since I'm currently also at the @SoftwareSaved Collaborations Workshop, I was thinking maybe it's an interesting hackathon project for both this unconf and the workshop. :beetle: :beetle: in one :facepunch:

ScottBGI commented 9 years ago

It would maybe be using overly broad brushstrokes, but would a similar approach to this paper that tracked downstream outputs of patented v non-patented genes work for published propriety v non-propriety (...hopefully open) software work? http://www.nber.org/papers/w16213.

Rather than HGP v Celera, some institutions/funders have strong open source policies (Sanger/WT?) and others are more "Bahl-Dole act" centric (Broad and others in the US?), so comparing their outputs might provide some useful data.

mfenner commented 9 years ago

Interesting, but maybe more work than waht can be done in a hackathon. I guess the tricky part is to collect license information of software mentioned in papers - my guess is that for software not in code repositories that could be hard work.

For simpler questions, e.g. MIT/Apache vs. GPL licenses, or vs. no license, I have a dataset of software repos (originally from @jure) with citations from a variety of sources, including Europe PMC: http://software.lagotto.io. Focussing for a example on the subset from Github (about 1,200) repos, it should be straightforward to get the license information.

An even simpler question would be to look at the stars/forks vs. type of license, or no license, or other metrics such as Facebook mentions.

npch commented 9 years ago

For me, the importance of showing impact on number of citations correlates with mentions of software / openly licensed software is that the people we're trying to persuade with this argument don't recognise alt-metrics like forks/stars.

So saying that mentioning your software in the paper / openly licensing it gives you more stars doesn't give a persuasive argument, but saying it gives you more citations / faster citations is persuasive. On Thu, 26 Mar 2015 at 08:55 Martin Fenner notifications@github.com wrote:

Interesting, but maybe more work than waht can be done in a hackathon. I guess the tricky part is to collect license information of software mentioned in papers - my guess is that for software not in code repositories that could be hard work.

For simpler questions, e.g. MIT/Apache vs. GPL licenses, or vs. no license, I have a dataset of software repos (originally from @jure https://github.com/jure) with citations from a variety of sources, including Europe PMC: http://software.lagotto.io. Focussing for a example on the subset from Github (about 1,200) repos, it should be straightforward to get the license information.

An even simpler question would be to look at the stars/forks vs. type of license, or no license, or other metrics such as Facebook mentions.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/38#issuecomment-86408952.

jure commented 9 years ago

@ScottBGI I guess it's a similar case, though I don't know how much of the methodology applies.

@mfenner I guess if the software is not in a code repository (and not attached in a .tar.gz or .zip file), it can't really count as open software? I imagine that's a contentious issue, but intuitively, there probably aren't a lot of pieces of software licensed openly and not available for download directly from the paper.

Speaking broadly:

Identify a large sample of software publications, i.e. publications focused solely on presenting a piece of software (as a single data point, I have one like that too, with no source or repository included in the publication http://dx.doi.org/10.1021/ac1014832 and behind a paywall too)
Determine a grading scale for each publication, something like: does it include a link to a repository/archive, does the repository/archive include information about a license, is the license one of the approved ones by http://opensource.org/licenses
Get metrics for each paper (from software.lagotto.io, parascope.io, det.labs.crossref.org, etc.)
???

For the sake of easier metodology and potentially a more persuasive initial argument, as @npch points out, we could focus solely on citations and disregard the wealth of usage information out there, for now.

jure commented 9 years ago

One way to identify software publications would be to follow this pattern name: description including the keyword software:

MEGA2: molecular evolutionary genetics analysis software EMBOSS: the European molecular biology open software suite GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism Arlequin: a software for population genetics data analysis

etc.

mfenner commented 9 years ago

@npch I completely agree, citations of software in the scholarly literature are much more important than social metrics such as stars or forks.

As a strawman I looked at the five repos in my set with the most citations using Europe PMC fulltext search (http://software.lagotto.io/sources/europe_pmc_fulltext):

https://github.com/najoshi/sickle: MIT (54 citations)
https://github.com/lh3/wgsim: no license (30 citations)
https://github.com/jstjohn/SeqPrep: MIT (29 citations)
https://github.com/lh3/seqtk: no license (19 citations)
https://github.com/vsbuffalo/scythe: MIT (15 citations)

jure commented 9 years ago

Your strawman shows a bit of the expected complexity. For example, https://github.com/lh3/wgsim/blob/master/wgsim.c is MIT licensed, but there is no LICENSE file in the repository or any mention of this in the README. The same goes for https://github.com/lh3/seqtk/blob/master/seqtk.c, also MIT licensed.

It's probably safe to infer that both of these are MIT license, which makes the 5 top cited examples from your dataset MIT licensed.

mfenner commented 9 years ago

Thanks @jure, I didn't dig deep enough for license information.

npch commented 9 years ago

I'm really interested in what effect mentioning software in a "traditional" paper has on that paper's citation profile.

e.g.

does mentioning software in a paper give more citations on average than similar papers which don't mention software
if that software is openly licensed, does it improve the citation profile
if the software is mentioned by way of a reference / footnote / embedded URL which is followable, does it improve the citation profile
if the software and data are referenced, does it improve the citation profile

So I think what I'd like to try and do is:

understand what information we need for a "test" dataset
use tools to create a representative dataset of papers
check and clean the dataset
pose some simple questions on the dataset (maybe starting with: can we identify software mentions in the dataset, then doing things like seeing if software is mentioned more in more recent papers than older papers etc)

On 26 March 2015 at 09:42, Martin Fenner notifications@github.com wrote:

Thanks @jure https://github.com/jure, I didn't dig deep enough for license information.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/38#issuecomment-86423574.

Neil Chue Hong Director, Software Sustainability Institute EPCC, University of Edinburgh, JCMB, Edinburgh, EH9 3FD, UK Tel: +44 (0)131 650 5957 http://www.software.ac.uk/

LinkedIn: http://uk.linkedin.com/in/neilchuehong Twitter: http://twitter.com/npch ORCID: http://orcid.org/0000-0002-8876-7606

mfenner commented 9 years ago

One nice test dataset is the OA corpus in Europe PMC. They have a nice fulltext search API, where you can for example check where the match was found (e.g. reference list vs. methods section). @njahn82 has done some nice on this using R: https://github.com/njahn82/dvcs_epmc

ebressert commented 9 years ago

On the arXiv you can do full body search on all pre-prints and match the articles to systems that track citations. The full body search would allow us to find the open source packages without having to download the articles or text. We just need the article headers and enough info to track their citation rates.

There's over a million articles on arXiv covering physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Should be a rich resource?

mfenner commented 9 years ago

@ebressert could you provide one or more links on how to do a fulltext search in ArXiv? Probably I didn't look hard enough, but I couldn't find a good starting point.

ebressert commented 9 years ago

@mfenner On the landing page there's a search bar in the top-right. Next to it the search bar there's a drop down menu where you can select "Full text". Check out the screenshot for a visual below.

ropensci / unconf15

Show that open software in a paper brings more impact #38