ourresearch / depsy

Track the impact of research software.
http://depsy.org
MIT License
190 stars 11 forks source link

co-authors vs git commits #26

Open mr-c opened 8 years ago

mr-c commented 8 years ago

Hello,

The khmer software paper lists (nearly) all of our github contributors; this seems to erase the weighted impact measurement based upon commits & files modified?

http://depsy.org/package/python/khmer

I suggest that where there is both a co-authorship & VCS data that the VCS data is used to split up co-authorship.

ethanwhite commented 8 years ago

:+1:

jasonpriem commented 8 years ago

This is an interesting one. As many people have observed, authorship and credit are super complicated, and deeply embedded in the set of social and political structures that make academia. Making people think harder about how we do that is a big part of Depsy's goal.

The cool thing about software is that we've got these two sources of authorship data (commit logs and authorship lists), and as these two sources collide, they expose the (good and bad) assumptions each system brings with it.

The commit log is measurable, objective, granular. The author list is political, embedded, subjective.

Where we can, we're using the commit log because 1) it's a different and new approach that will spark conversation, we hope and 2) it allows the long-tail of committers to get more much-deserved credit, and 3) we're people who believe that lots of things benefit from objective measurement.

But where we've got authorship information, we're erring pretty hard toward trusting that over the commit log, simply because when it comes time for the reward system to, well, reward--the authorship lists are the ones that matter. The reward system is political and embedded and subjective in the same ways that the negotiated author list is, and the way the authorship list is created is not ignorant of this.

Placing a person with one commit in the authorship list means something. There's signal there. Maybe that was the most important commit. Maybe that person had the idea for the project. Maybe whatever. Point is, the authors say that person is important, and they are in a position to know better than us, so we believe them.

danielskatz commented 8 years ago

It's too bad there's not a system for people to define contributions in either a numerical or at least relative sense. There's no way depsy can guess and win - and it shouldn't have to. Maybe Depsy should say what it wants (a file in some format at the top level of the repo?) and see if developers can/will supply it.

ethanwhite commented 8 years ago

@jasonpriem - you make a bunch of great points about the complexity involved in this decision.

My main concerns with just using authorship as is done for the Python packages are:

  1. Most researchers have the idea that there is some minimum level of contribution that designates authorship. This is an inherent limitation of a binary designation of contribution. Recently, when @ctb's lab included all everyone with at least one commit as authors on a software paper there was some controversy about this decision (see the comments; plus lots of discussion on social media). No one denied that folks with small numbers of minor commits had made a contribution, they questioned whether it was enough to be made an author. Some folks raised a distinction between an author and a contributor. Commit logs allow a more continuous measure of contributorship and I think giving proportional credit to anyone who contributes is a good way for efforts like Depsy to encourage good open source citizenship.
  2. Relying on authorship entirely in these grey areas will end up providing more credit to those who are more aggressive about asking to be added as authors. There will likely be a number of biases in who will be willing to ask for this credit. This is also my concern with @danielskatz's recommendation in that many folks (both contributors and maintainers) won't want to engage in discussions/debates over exactly how much credit each contributor should receive.
  3. As implemented the current approach simply gives equal credit to all authors. This isn't right in most cases but given cultural differences among fields it is probably the only reasonable solution (since some fields sort author lines alphabetically, others by contribution, others by contribution but with last author being a major position of importance, etc.). Using information about relative contributions presumably provides some insight into relative contributions that is closer to reality than assuming equal contributions.
  4. Some projects eschew listing out authors and list an organization as the author (e.g., see https://pypi.python.org/pypi/ipython). This is likely to be more common for bigger more important projects with more contributors and risks none of the credit for these efforts being counted.
  5. My impression of the standing culture for Python code is to only list a single author since the metadata field isn't designed to include multiple authors. This is how we did things for https://pypi.python.org/pypi/retriever until I noticed that only the original lead developer was being picked up by Depsy.

I also have a concern about just using commit logs, which reflect @jasonpriem's point that sometimes people can make very important contributions to a project without generating a lot of commits. E.g., I know folks who do a lot of code review on projects and participate actively in design discussions, but this might not show up in commit logs. These folks will often be included as authors even if they only have a small number of commits (or even none).

So, all that being said, what about some sort of weighted combination of the two approaches when both authorship information and commit logs are available? I don't know exactly what the right answer is (and maybe there isn't one), but what about assigning 50% of the credit evenly across designated authors and 50% of the credit based on relative contributions from commit logs. This probably isn't perfect, but it seems to incorporate the benefits of both approaches and makes weaker assumptions about the intent of the developers in designating authors.

jasonpriem commented 8 years ago

@danielskatz Yup, agreed...we're never going to know exactly given the low signal provided by the present system. I like the idea of a file people can use to specify exactly....it could leverage one of the many taxonomies or controlled vocabularies out there already for specifying effort. Many people have suggested it'd be awesome to have something like the credits at the end of a film...whether you're Best Boy or Director, you get appropriate credit for your contribution.

It's interesting to consider whether it's Depsy's role to encourage folks to include these kinds of files. So far, we've mostly looked at the project as a way of demonstrating how to leverage whatever data is already there. But your suggestion could be a next step...we'll be listening to hear if there are other calls for something like this, for sure.

jasonpriem commented 8 years ago

@ethanwhite thanks for weighing in, there are some great points here that deserve a longer response...will get back in more detail in next few days...

jasonpriem commented 8 years ago

@ethanwhite Academia (and general usage of the term for that matter) views authorship as a boolean, not a float. You don't have an amount of authorship on a book, or a painting, or a song, or a paper....you either are an author or you ain't. And so you get the social or economic rewards of the "author" role, or you don't. (Blaise Cronin has been doing fascinating work for years on how authorship/acknowledgement works in academia. Esp recently, he's done great stuff looking at changes in authorship, and comparisons with other domains like art)

Upshot is, I think we agree that this yay-or-nay conception of authorship is impoverished: it's exceedingly inviting to political manipulation and bias, it's vague, it's often unfair, and it's largely a legacy of print-based thinking.

But Depsy's approach is to say: we don't make the rules. The people with the money do. This accomplishment of inducing (by some combination of begging, cajoling, cheating, leveraging, working, or whatever) people to call you "Author": the money-distributors and prestige-distributors care about that accomplishment. In fact, in practice they generally care about it more than than they care about your actual intellectual contribution to a given product. Because it suggests you will be able to negotiate/earn the label called "Author" in the future as well, and that's the currency.

So, as long as this keeps being the case, I think it's responsible for depsy to keep caring about whatever thing it is that people do to be called "Author," regardless of how they made that happen.

But of course want to push academia further along, not just empower the current system. Hence the fractional accounting powered by commit records (which is of course not without its own inaccuracies), which it sounds like everyone in this thread likes just fine. Yay for that.

Sounds like maybe we just differ a bit on how prominent that approach should be in they app. And our thinking for now is that it's more powerful, in 2016, to honor the Author role with all its attendant meaning but alongside that start to demonstrate to decision makers and working researchers that there could be a more nuanced, more responsive way, a (somewhat) more objective way to look at authorship as well. Hopefully this gives us a better chance to be part of real conversations, while still making our point loud and clear.

More on your specific points:

  1. Agreed, commit logs ftw. See above.
  2. Agreed, this is a weakness of trad authorship accounting and it deserves fixing. We're hoping Depsy will move things in that direction (see above). Although I'd be remiss to not note that the flexibility of politically-mediated Authorship can be an asset as well...for example, non-programmers may be hugely influential in a project, and the ability to collectively assign Author credit can help them get fair recognition. But that said, I agree: we must have more objective, code-based measures as well. See above.
  3. Agreed. We looked into gleaning signal from author list order but was just too discipline-specific. Solveable actually, but outside our scope for now. Lightly informing Author importance with commit records is a good idea, and we tried it out. It becomes hard to decide the weights (how much should "I'm on the Author list" contribute compared to "I made almost no commits" when it's a grad student? A PI who got the funding? A doughty pruner of and responder to bug reports?). So we just decided to keep it clean and simple for now, given that we can't adequately defend any given proportional admixture.
  4. True. In these cases, the listed org-author shares some credit with all the committers. So the committers (who are real people, not just an org) still get credit, and the org gets no more credit that the most prolific committer. iPython is a good example...the "IPython Development Team" only gets 22% of credit, with the rest going to the actual committers.
  5. Darn, this looks like a bug where we were not able to associate the PyPi project with its github repo...so we Depsy doesn't actually know anything about the github committer at all. Alas, we have several of these...when the github page is not explicitly listed on PyPi we have to make a guess by searching GitHub for the name/code for that project, and it's not very precise.

I like your idea of doing a bit of both worlds, I think we're on the same page there. We just think the best way to do that is to keep the Author credit totally separate from the commits-based credit system since there is a real impedance mismatch there. But I think it ends up being pretty similar in effect.

Thanks for the thoughts. These are super interesting issues with no One Answer for sure, and you are pointing out a lot of great stuff. And may well be totally right :)

ethanwhite commented 8 years ago

Darn, this looks like a bug where we were not able to associate the PyPi project with its github repo...so we Depsy doesn't actually know anything about the github committer at all. Alas, we have several of these...when the github page is not explicitly listed on PyPi we have to make a guess by searching GitHub for the name/code for that project, and it's not very precise.

Ah, a lot of my thinking as being influenced by looking our our project and khmer, which both suffer from this issue. Now that I'm looking at IPython I see that this is actually working in a way that I think is totally fine 😳. I.e., you've balanced how these two different sources of data contribute to "authorship credit". Sorry for the confusion.

In the last release we both added the GitHub link to the Home Page metadata field (hopefully that's the right one) and expanded the author list, so we should be all fixed the next time the index gets rebuilt.