ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

R Package citation tools #24

Open njtierney opened 7 years ago

njtierney commented 7 years ago

Citing R packages in papers and analyses should happen more than it does.

I think it'd be great to have a tool that collects all the references for all r packages in an rmarkdown file, or script, or even RStudio project. Perhaps this would collect te output of citation() and put it into a .bib file.

@cboettig has done some really cool work on this here, and I've got a little github gist.

Perhaps this could also use some of the work in packup to find R packages etc.

jsta commented 7 years ago

I often see people citing R core instead of individual packages which I find very frustrating.

Pakillo commented 7 years ago

Good idea. I think in most cases people are just not aware they should cite packages as well as R core... So there's a lot of awareness work to do (e.g. when reviewing papers).

But for those who already want to cite packages, it's true it would help to make that as easy as possible. For people using R markdown, I think it's already quite straightforward e.g.. papaja has r_refs and cite_r functions to make a bibliography of all the packages used. There is also LoadandCite in repmis package.

So there are some tools out there. But perhaps the process can be made even easier. Particularly for people using Word or similar to write papers (big majority by now). Something like a cite_packages function that scans your script for package calls/usage and produces a txt/Word/PDF file with formatted citations for all those packages (in the bibliography format of choice i.e. using CSL files). Something they can paste directly in their manuscript reference list... Would that help?

njtierney commented 7 years ago

Awareness is definitely an issue worth considering! I think that it just doesn't occur to so many people. Not sure how to go about this. One approach might be to write letters to journals of various fields, or perhaps get statistical societies such as ASA, SSA, and RSS to make official statements.

I hadn't seen those R packages before, they look really neat! :)

And yes, I reckon that's the idea - have a function that scans the script for packages, and this produces a txt/word/pdf with nice citations. And perhaps a .bib file as well.

It looks like papaja does some of what we need already, which is great! But maybe it could be useful/easier to think about if the R package that does this citation magic has only a couple of functions, and has one goal: Make citations of R packages easy.

Pakillo commented 7 years ago

Yes, I agree. And the building blocks are already out there: (i) scan scripts for packages used, and (ii) create .bib file and/or rendered text with formatted citations for copy-pasting into ms. Just need to put everything together. And increase awareness for package citation!

njtierney commented 7 years ago

Sounds great!

Just had an idea. This package will make it easy for users to cite R packages.

What if it also makes it easy for package builders make their work more citable.

This might involve getting a doi, or adding info to the inst/CITATION directory or something?

Just an idea.

noamross commented 7 years ago

Maybe one output could be a PR to devtools to add devtools::use_citation().

devtools::get_zenodo_doi() could be a related function (use_citation(get_zonodo_doi=TRUE)?) that would call https://github.com/ropensci/zenodo . That package needs some love to bring it up to date with the Zenodo API and finish it off. Perhaps that could be another unconf project.

njtierney commented 7 years ago

Sounds great! :)

cboettig commented 7 years ago

Interesting thread. A few issues I think are worth keeping in mind:

Not to derail the thread, but I'm slowly coming round to the view that within the context of papers, citing software papers rather than software directly is the most practical route. While I appreciate the importance of provenance identifying the correct version; from a reproducibility standpoint I really think we have to focus on publishing code (e.g. the R compendium approach provides a much more robust though not flawless way of tracking versions of packages used in code from an analysis), while from a credit standpoint I think that system is both fundamentally flawed and not likely to change quickly. Thus from the view of the idealist it doesn't make sense to me to piggyback on such a system as a good way to track credit or provenance in software use. From the viewpoint of the realist working within the constraints of the current system, I think promoting the use of software papers as 'buckets for collecting citations' is the most practical hack we have available. I'd be happy to be convinced otherwise!

Meanwhile, I'm deeply interested in out-of-the-box solutions to the credit and attribution issue; things like the transitive credit idea of @danielskatz & @arfon that might be built out from software metadata directly.

noamross commented 7 years ago

1) CITATION contains the preferred citation of the package authors, and is often the relevant software paper, so in general, I think encouraging use of/pulling from CITATION makes sense 2) Indeed, I tend not to cite all the packages I use, but those that are core to my methodology. But I think it would be useful to have a project_citations() function that would grab all the citations for packages used in a project/repo, which the user could then choose from and then export appropriately.

danielskatz commented 7 years ago

The Software Citation Principles paper (https://doi.org/10.7717/peerj-cs.86) talks about this.

In particular, from the principles themselves:

Software should be considered a legitimate and citable product of research.

Software should be cited on the same basis as any other research product

Software citations should facilitate access to the software itself

Software citations should facilitate identification of, and access to, the specific version of software that was used.

And from the discussion section:

If a software paper exists and it contains results (performance, validation, etc.) that are important to the work, then the software paper should also be cited. We believe that a request from the software authors to cite a paper should typically be respected, and the paper cited in addition to the software.

[Regarding identifiers that are cited] For software, we recommend the use of DOIs as the unique identifier due to their common usage and acceptance, particularly as they are the standard for other digital products such as publications.

we recommend that the software identifier should resolve to a persistent landing page that contains metadata and a link to the software itself ... [as] currently offered by services such as figshare and Zenodo

So, if you've read this far, I would summarize this, plus a bit of my personal opinion as:

  1. Cite the software itself, and an archived version of the software, not the software on GitHub, nor a software paper.
  2. Cite the software that you actually use directly that is important to you, but not software that it uses (this is where transitive credit [http://doi.org/10.5334/jors.by] should come in), or common software that is not really important to your work
  3. If there's a software paper that you want to cite too, go ahead, but not instead of citing the software
  4. If there's a CITATION file, cite what it say too, but not instead....
  5. Regarding gathering citations across versions, that where a Group ID (https://danielskatzblog.wordpress.com/2016/04/17/to-better-understand-research-communication-we-need-a-groid-group-object-identifier) should come in
noamross commented 7 years ago

We could do a few things to encourage the practices @danielskatz describes:

njtierney commented 7 years ago

Just wanted to add a couple of ideas I've had.

# automatically create a bib database for R packages
knitr::write_bib(c(
  .packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')

You can see more info here at the helpfile,

on that note - does anyone know if there is a way to access the HTML help file? Like when you type ?pkg::fun, are all the helpfiles in all of CRAN stored somewhere?

njtierney commented 7 years ago

HTML Link to write_bib from rdocumentation

njtierney commented 7 years ago

Also, perhaps we can include links or a way to pull citation language styles (.cls) files from their amazing repo

MilesMcBain commented 7 years ago

Thanks for mentioning packup, but it's just a bit of fun that doesn't do a whole lot. I think the only useful thing there is going to be this regular expression which I hereby donate to this project for unlimited use of any kind.

Do you think we would want to execute this function against individual source files? It seems more likely that we want to find all package dependencies starting from the root of the project and working down. That's not so hard but there are always edge cases. We'll want to match require, requireNamespace attach, and namespace::. Potentially also import::from and modules::import, conditional on detection of their parent packages being detected. Wouldn't it be simpler if people would just put all this in some kind of DESCRIPTION file? :P

So it turns out declarative statements about the research namespace are quite useful in this situation, in addition to #22 .

njtierney commented 7 years ago

Sorry this is a bit rushed!

Summary/wrap up of the citation package/issue

Citing packages in R can be a bit tricky, in particular we discussed citation around the topics:

I see two distinct paths for this project at the unconf:

  1. A set of functions to assist the process of making citations for software and using citations for software
  2. A discussion and collation of how software can be cited / how should they be cited.

Personally, at the moment I am more interested in tackling the ideas around issue #5, but I still feel that this is an important topic, so would prefer not to close it.

I think that topic 1 is perhaps more suited to the unconf, and the idea of software provenance is much more strongly tied into issue #5. I have moved the relevant discussion on issue 2 into a google document, here.

Possible functions for this package

Regarding the possible functions that might be created, these might either be wrapped up in their own package, or perhaps even as a pull request to devtools or goodpractice.

We also discussed the idea of boosting signal of packages and citation, in particular, some ideas discussed were:

danielskatz commented 7 years ago

cc @mfenner @npch as co-leads of https://www.force11.org/group/software-citation-implementation-working-group

njtierney commented 7 years ago

OK, quick note here that @yihui has an awesome function in knitr.

knitr::write_bib(x = c("knitr", "ggplot2"), file = "test.bib")

This will generate a .bib file named "test.bib", with reference hooks written for knitr, and ggplot2 - automatically labelled as R-knitr and R-ggplot2.

Pakillo commented 6 years ago

Hi,

In some spare time I had a go and made a very basic prototype: https://github.com/Pakillo/grateful. Incomplete and probably buggy, but seems to work already.

Feel free to improve, modify, fork, or reuse if you find something useful in it

Cheers

maelle commented 6 years ago

See also https://github.com/dirkschumacher/thankr by @dirkschumacher

gmbecker commented 6 years ago

Very late to this discussion, but I do have some thoughts and wanted to give my two cents as I've thought about this a fair bit.

@cboettig I'm not sure I fully agree with what you said, although major parts of it align with my views. I agree that citing what I'm going to call "mechanical" dependencies of a package in a scientific paper context doesn't make much sense (with, imho, one exception which I'll get to at the end). On the other hand, I don't know that I agree about the wrapper package issue. Assumedly there is some reason you're using the wrapper rather than not, and that work seems, in at least some cases worthy of citation in addition to citing the underlying work being wrapped.

Honestly, a murkier situation is actually that of your citation tools, which are "core" to the construction of the manuscript itself but completely divorced from the science being discussed. That case is pretty tough, and I think not citing the the work is both unfortunate/problematic and ultimately justifiable.

I don't agree with the implication (if intended) that packages without corresponding papers shouldn't be cited. That seems to me like saying you wouldn't cite a dataset which only appears on figshare and isn't associated (yet, perhaps?) with a first-class publication. Maybe that is the case, but if so I'd disagree there as well... Also, as a practical matter, writing proper software research papers (JOSS is not that, but that is a discussion for a different time) is hard. I know because I write them for a living (well, part of it), but you don't have to take my word for it. I've heard estimates of software papers being the result of 2x (Luke Tierney, R-core + accomplished statistician) to 3x (Michael Kane, bigmemory author) the amount of work for an equivalent science/statistics paper. Expecting software authors to do multiple times the equivalent amount of work before we are willing to cite them seems pretty .... not great.

And finally, as an aside (this is quite long already, I know), I don't think we should ever be citing any R packages without also citing R itself. R both rose out of legitimate statistical computing research (Ihaka and Gentleman) and is the result of a ton of largely unappreciated subsequent research and maintenance work by R-core.

cboettig commented 6 years ago

Hi @gmbecker, I think we're basically on the same page actually. Basically the question boils down to one of audience: I'm all for continuing to advocate that those writing papers always cite the software they use, including wrappers, including base R, including meaningful dependencies, but (sadly) not including wonderful authoring or platform tools (rmarkdown, pandoc, citation tools, runtime platforms like docker or OS level things, etc). Citing the packages, and citing specific versions of the packages is and should be standard.

As either realist or cynic though, I think in our advocating of such goals we owe a greater acknowledgment of (A) the technical limitation that citation will never serve as robust package management system for dependencies and (B) the social limitation that citation-as-credit model is a badly flawed system to it's core which was co-opted rather than ever intended to provide metrics of credit and importance. Though citing software might go some way to improving the recognition for the value of software, but far from being the gold standard we should aspire to, it can at very best only hope to replicate the flawed system we have entrenched for papers.

I would much prefer our community to focus time & energy on envisioning a better system from the ground up with brand new ideas -- I think the transitive credit notion of @arfon and @danielskatz is a great example of this. Approaches we can compute programmatically and not rely on the foibles and politics of individual authors. The idealist in me would much rather put effort into this chance to avoid rather than repeat the problems of citations as metrics.

Meanwhile, the pragmatist in me recognizes that scientific software developers need credit now, and asking people to just cite software isn't going to do them justice. I agree entirely with you that writing software papers is hard, but I also believe that time is not wasted as it strengthens the software and the use of it. Again, I applaud the creative solutions in this space such as JOSS and others that seek to lower these barriers. If someone pursuing an academic career today asked me how to get credit for their efforts, I would still tell them to write a paper and ask people to cite that paper, merely as a citation bucket. People are still more likely to cite papers, and people are still more likely to pay attention to papers with high citation counts than they are to software. I think we all agree this is unfortunate but also our current reality.

To the extent that we want practical advice that acknowledges current social norms, I think the software paper is the most practical option. To the extent that we want to change the social & cultural norms for academic credit, why build on the buggy legacy code of citation when we have the opportunity for a ground-up rewrite?

danielskatz commented 6 years ago

I feel compelled to repeat/quote this:

Though citing software might go some way to improving the recognition for the value of software, but far from being the gold standard we should aspire to, it can at very best only hope to replicate the flawed system we have entrenched for papers.

Which I assume refers to https://openmetrics.jiscinvolve.org/wp/2017/11/leaving-gold-standard/ ...

gmbecker commented 6 years ago

@cboettig it does seem like we largely agree. I must admit I'm somewhat confused by the point (which has come up a few times in this thread) that citations are not useful as a reproducibility/provenance mechanism. I agree, completely, but I guess I'm wondering why that's relevant here? Are you aware of proposals that they be used as such? I'm not, and if I were I would more-or-less immediately discount them.

In my mind, citations are largely for one thing (With a few other notable consequences/corollaries to it): acknowledging research/work not done by the manuscript authors which meaningfully (i.e. scientifically) informed/contributed to the work being discussed. The secondary purposes of citations imho, are to increase visibility of said other work, and to place the work discussed in a manuscript within a larger context.

The extension of the above to software seems, to me, to be something along the lines of the following simple test: Did you use the software to do or implement something discussed in the manuscript? If yes, citing is "necessary", if not, it isn't, though it still may be appropriate.

So if you present or discuss in text a GAM fit , you should cite the software you used to fit it. If you make reproducibility claims in the text, you should cite the software you used to guarantee them, if not you don't need to. Etc.

I'm also a bit confused about the hard distinction being made between citing software "directly" and citing software papers. In my mind, the software either needs to be cited or it doesn't. If it does, and there's a paper, cite that, if not, cite the package/python module/commandline tool/whatever. I don't think a lack of a paper is a good reason to leave the citation off all together. (I do agree with your assertion that the author should be cited in whatever manner s/he prefers, though in practice I think that will align almost perfectly with the above rule...)

danielskatz commented 6 years ago

Regarding what software to cite, you might want to read the first discussion section (What software to cite) in https://doi.org/10.7717/peerj-cs.86 (sorry, PeerJ doesn't have section numbers which makes this a bit confusing) for some thoughts

Maybe a question to illustrate the point is if you would cite Microsoft Word for a paper you were writing using Word? Or MacOS 10.13.1, if you were using a Mac to write the paper. My feeling is that you would not cite these, because they are not really important to the research results.

The tie to provenance and reproducibility is that you would need to state the OS you used, plus a lot of other details, to say how you did the research, even if these are not things you would feel a need to give credit for.

cboettig commented 6 years ago

@gmbecker

citations are not useful as a reproducibility/provenance mechanism. I agree, completely, but I guess I'm wondering why that's relevant here?

In making the case for citing software (e.g. Software Citation Principles Dan linked above, which I endorse), we say to cite the version of software because that is important for reproducibility, knowing if a buggy version was used, etc. I agree with all this in principle; in practice I think we just get buggy citations; we would get a far better handle on these issues using a non-citation-based tool (e.g. packrat manifest or equivalent as part of the supplement). By all means we should encourage users to follow these principles, but we should not be surprised when they continue to view citing R as equivalent to citing the operating system, or your BLAS libraries, (or the papers behind those algorithms -- it is turtles all the way down).

To me, citations are a pre-digital-era invention designed to provide an pre-digital provenance trail of ideas to build a supporting and verifiable trail of evidence. Citations were then co-opted as measure of importance and thus a measure of credit, and have become normalized as an acceptable way of acknowledging credit, which has distorted the way citations are used. I think we can agree citations are a deeply flawed metric (if you cite my software only to say it is a buggy piece of junk that's just boosting my score), and I think we could build something much better from scratch to get a more realistic and nuanced picture of the importance of things.

Obviously citations are deeply entrenched in the publishing culture and not going away any time soon. Nor is the snobbery that papers count and software does not going to vanish overnight. To the extent that we must play the credit game, my advice to people writing software is to play it under the current rules and use JOSS. To the extent that we want to advocate for a better system of getting credit, I would prefer we didn't try to hack it onto the already-overloaded citation practices

gmbecker commented 6 years ago

@danielskatz I think(?) this section seems to be in broad agreement with what I was trying to say. Though it does explicitly say its not going to talk about which software should be cited, which is what i was trying to get at.

I do have to say, in my opinion expecting citations to do the work of provenance is a problematic expansion of scope of purpose. I agree with Carl and others that they're not well suited for that, and would add that it's not really a direct extension of what I understand the purpose of traditional citations to be.

gmbecker commented 6 years ago

@cboettig didn't see your last message before i sent the one above, so I'll send a new message and discuss this further.

I think provenance and credit are both crucially important in improving the quality of results and the social environment and recognition that academically meaningful work comes in many forms. I just don't, at all, think that they are the same. Conflating the two problems, by attempting to solve both of them with a single practice is likely to leave both component issues underserved in comparison to properly, and independently, solving them separately. AT least that's my opinion.

It also seems, in one sense at least, easier. I suspect (i.e. conjecture without any evidence) that journals would be happier to demand provenance information in the form of supplementary materials (like many/most now do with the data) than they would to fundamentally change what citation "means" and its intended purpose.

I personally dislike JOSS because if you show a "real scientist" (emphasis on the scare quotes) a JOSS paper I think their response is going to be "well that's not a real paper, what is that?" which undermines the concept of software papers representing real, valuable research. It juts seems so easy for it to socially backfire. I know my opinion on that is not the popular one, though. Probably has something to do with the fact that I consider myself to do actual research, the description of which ends up being software articles very different from JOSS articles which I think are primarily about credit for writing software.