mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.67k stars 1.31k forks source link

Simplify license headers in source files #11605

Closed cbrnr closed 1 month ago

cbrnr commented 1 year ago

Currently, our source files show individual authors as well as the license in a comment near the top, for example:

# Authors: Jaakko Leppakangas <jaeilepp@student.jyu.fi>
#          Robert Luke <mail@robertluke.net>
#
# License: BSD-3-Clause

I think this is not ideal for the following reasons:

  1. It is not trivial to keep the list of authors up to date. Since this information is available from Git anyway, it is also not necessary.
  2. I don't think we have always updated these lists, so currently this information is likely outdated.
  3. There's always the question what kind of change(s) qualifies an author to get added to that list.

I suggest that we replace our headers with the following license text across all source files:

# Authors: MNE-Python development team
# License: BSD-3-Clause

The exact wording can be changed of course (feel free to make suggestions), but the point is to get rid of individual authors.

drammock commented 1 year ago

I would like to ask if anyone can give concrete examples of why and which specific statistics are important to have.

This isn't about any one specific statistic being missing. It is about two intersecting larger problems:

Note that these two concerns can be in tension: a person form a marginalized / historically underrepresented community in CS / open source might benefit from the bias that code contributions are more valuable, if they can demonstrate their coding prowess and thus successfully overcome other peoples' bias that "people like them can't code".

cbrnr commented 1 year ago

I understand these problems, but what I don't understand is why this is tied to having a name in the header of the source file. It's what MNE happened to choose, but I don't think that this proves contributions/competence/worth etc. when we already have other ways to demonstrate this (GitHub, git, our changelog with contributors). I think having selected names in source headers does more harm than good, so we should try to improve on the current situation rather sooner than later.

drammock commented 1 year ago

I have personally heard from folks that they used the fact that their name was in the source file to prove the extent of their contributions in the context of a job hire or promotion interaction. This is not a hypothetical. Removing their name removes their power to do that.

cbrnr commented 1 year ago

I have personally heard from folks that they used the fact that their name was in the source file to prove the extent of their contributions in the context of a job hire or promotion interaction. This is not a hypothetical. Removing their name removes their power to do that.

OK. I think they could have proved it with some other measure as well. That's exactly my point, just because names in the source seemed to have worked doesn't mean that there are no alternatives.

hoechenberger commented 1 year ago

I wonder if specifically for the act of job applications we could create a script that uses git to generate some stats PLUS allows for manual insertion of additional accomplishments / comments that can be added by a core team member. This information is then used to create a digitally signed certificate whose authenticity can be verified online. Just like the certificates of completion you can get via LinkedIn Learning. We could vouch for the contributions and skills of a contributor.

drammock commented 1 year ago

OK. I think they could have proved it with some other measure as well.

Why do you think that? Why not believe them when they say "this is what worked"?

hoechenberger commented 1 year ago

I wonder if specifically for the act of job applications we could create a script that uses git to generate some stats PLUS allows for manual insertion of additional accomplishments / comments that can be added by a core team member. This information is then used to create a digitally signed certificate whose authenticity can be verified online. Just like the certificates of completion you can get via LinkedIn Learning. We could vouch for the contributions and skills of a contributor.

I'd volunteer to draft such a document template

hoechenberger commented 1 year ago

OK. I think they could have proved it with some other measure as well.

Why do you think that? Why not believe them when they say "this is what worked"?

He was just suggesting that because one approach worked doesn't mean that another would not have worked. However if we want to work like scientists (evidence-based), I'd also say: what has worked in the past has been shown to work, so why take the risk of switching to something else that might fail.

cbrnr commented 1 year ago

Why do you think that? Why not believe them when they say "this is what worked"?

I never said I didn't believe them. I just think that very likely other things might have worked as well. Re why not sticking to what has worked in the past, I think we have now identified the points why the current situation is less than ideal.

hoechenberger commented 1 year ago

I wonder if specifically for the act of job applications we could create a script that uses git to generate some stats PLUS allows for manual insertion of additional accomplishments / comments that can be added by a core team member. This information is then used to create a digitally signed certificate whose authenticity can be verified online. Just like the certificates of completion you can get via LinkedIn Learning. We could vouch for the contributions and skills of a contributor.

I'd volunteer to draft such a document template

The more I think about it, the more I like this idea. Imagine you could produce a document that says you've worked on MNE and it bears the signature of a certain Alexandre Gramfort (if Alex is willing to play along 😅) This would provide a huge credibility boost, which, as was pointed out, might be especially beneficial to contributors from marginalized groups

jasmainak commented 1 year ago

I think it's not easy to fix the problem of credit assignment at the level of MNE. The person evaluating a certain application will use their own metrics. If it's a person in industry, they will see github contributions favorably, but in academia, the majority still don't understand open source. If there are 20-30 authors/contributors, the evaluators have a hard time wrapping their head around your contribution percentage. I don't think it's possible to put a number on this and this is the core problem of open source credit assignment.

Job applications and grant or faculty hiring committees are one example, green card applications are another (which in turn has exponential impact on other stuff). The evaluator could be a medical doctor or an army veteran who has no idea Alex is (famous as he is :)

cbrnr commented 1 year ago

@jasmainak the same argument applies to having a name in a source file header. It's all pretty random/arbitrary, so I am sure that we do not have to keep names for the sake of some committee probably wanting to take a look at some source file. I completely agree with you that credit assignment for open source work is a huge problem in academia (normally this doesn't count at all).

hoechenberger commented 1 year ago

@jasmainak

I think it's not easy to fix the problem of credit assignment at the level of MNE. The person evaluating a certain application will use their own metrics. If it's a person in industry, they will see github contributions favorably, but in academia, the majority still don't understand open source. If there are 20-30 authors/contributors, the evaluators have a hard time wrapping their head around your contribution percentage. I don't think it's possible to put a number on this and this is the core problem of open source credit assignment.

I think we're somewhat going in circles here. We have evidence that names in source files have worked for job applicants in the past. It was also argued that Git history may or may not work, depending on the job you're applying for. Lastly it was brought up that esp. for marginalized groups, even providing proof may not be enough (at least that's how I understood it) unless a respectable, trusted source vouches for these contributors. This is the specific issue I'd wanted to address with my proposal. I didn't mean to invalidate or suggest to get rid of any of the other approaches, but to introduce a new service (a certificate of contribution, if you will) for folks who believe it could help them land their next job / enter the next career stage. These certs would only be issued upon request, and the process to file the request needs to be designed such that it is super low-threshold (filling in a simple form on our website or so).

Job applications and grant or faculty hiring committees are one example, green card applications are another (which in turn has exponential impact on other stuff). The evaluator could be a medical doctor or an army veteran who has no idea Alex is (famous as he is :)

But at least there would be some reference, which can be important for job applications.

If I've worked at a company, I can put my old boss as a reference. If I've contributed to MNE-Python once, of course I could reach out to the core team and ask if somebody would be willing to act as my reference, but there's a huge barrier there. Issuing a certificate that credibly proves I've been involved with the project, and has the name and email address / contact details of a responsible person (core dev team member) on it, creates an aura of trustworthiness and credibility.

jasmainak commented 1 year ago

What I'm trying to say is that it depends on the job application and the level of the candidate. In some situations, it may be enough to show "some" contributions. In other kind of job applications, you have to come across as a major contributor. If you are a Masters student applying for a PhD or to a company, I can see how such a certificate or a recommendation letter may help. But in other situations, it is pretty unhelpful if the person evaluating cannot see that you made important contributions. I'm suggesting mindfulness to the fact that there is a diversity of situations to consider.

Being able to pinpoint yourself as a major contributor to specific submodules or sister packages of MNE will probably be more helpful in more situations than having your name buried in a list of 50 or even 10 names.

@hoechenberger to avoid going in circles :)

We have evidence that names in source files have worked for job applicants in the past. It was also argued that Git history may or may not work, depending on the job you're applying for.

yes, I agree with you here

Lastly it was brought up that esp. for marginalized groups, even providing proof may not be enough (at least that's how I understood it)

I just don't know if a generic certificate would be helpful in addressing this issue. Maybe it would help in some limited circumstances ... I would ask a person who might use it.

jasmainak commented 1 year ago

so I am sure that we do not have to keep names for the sake of some committee probably wanting to take a look at some source file

I can vouch that this actually does happen. I had a certain review committees dismiss some of my contributions to some other open source work based on my name missing from a website (which it wasn't but whatever) ... I'm not talking in hypotheticals here, I really worry about someone going to the MNE website and either not finding the person's name and/or thinking the person is "just another contributor".

cbrnr commented 1 year ago

Well, then we can make sure to put something on our website that this does not happen. My point is that names in source files are at least as flaky given that there aren't even rules who gets to be in that list.

cbrnr commented 1 year ago

And if this is really critical for someone, I think this person will find ways to show his/her contributions to the committee even when there are no names in source headers. It just involves a little work, but such things are very individual that I think this is best handled on a case by case basis instead of relying on our current source name bingo (sorry, but that's what it basically is right now).

hoechenberger commented 1 year ago

I can vouch that this actually does happen. I had a certain review committees dismiss some of my contributions to some other open source work based on my name missing from a website (which it wasn't but whatever) ...

But the website isn't the source code either ... so in this case a cert like I proposed would've helped, or a page where all contributors are listed in alphabetical order, potentially with a link to that GH page that would list all commits by that user, which was suggested somewhere earlier in this thread (I cannot seem to find the posting right now), no?

drammock commented 1 year ago

a page where all contributors are listed in alphabetical order, potentially with a link to that GH page that would list all commits by that user, which was suggested somewhere earlier in this thread (I cannot seem to find the posting right now)

Perhaps you mean Mathieu's comment and/or my response to it

hoechenberger commented 1 year ago

a page where all contributors are listed in alphabetical order, potentially with a link to that GH page that would list all commits by that user, which was suggested somewhere earlier in this thread (I cannot seem to find the posting right now)

Perhaps you mean Mathieu's comment and/or my response to it

Yea, that's it!

britta-wstnr commented 1 year ago

instead of relying on our current source name bingo (sorry, but that's what it basically is right now).

I think the discussion never was about keeping the source file names forever but more about keeping them until we have a better system in place.

In principle, I like the idea of having this be a part of our homepage and especially extending it by things like the list @drammock created.

One of the things I am not so sure about yet is what @jasmainak mentioned as well: if your name is on a source file that you can easily link in an application, this (for better or for worse) shows you had substantial contribution to a method. If we generate a report with all PRs, this is harder to see. Granted, there is PR names etc., but this would mean the search committee has to look through the PR list, know what a "substantial" PR name looks like or believe the candidate when the candidate says "PR x was substantial". Regarding this:

I think this person will find ways to show his/her contributions to the committee even when there are no names in source headers. It just involves a little work

This I would like to contest, as we cannot just assume that everybody has the privilege to be believed after putting in "a little work". The playing field unfortunately is not the same for everyone here.

Now, again, I don't thinkwe have to keep the header as is forever, but I think substantial contributions is one of the major "features" of source file names (biased as it might be right now) we maybe do not translate well yet.

Another one is visibility - while users can stumble across names in source files and then connect a persons name to a certain method/MNE-Python, a report on a homepage needs to be produced to be visible.

And a last thought for now: if we report those data on our webpage, we probably also need a way to be contacted about it, whether that be for wanting to be erased from the build-report-button* or whether someone thinks the reported info is not correct (e.g. because they were part of one more sprint or something like this).

*while PR lists are public information, other things might not be, e.g. sprint mentoring/attendance and the like.

drammock commented 1 year ago

Just came across this: https://github.com/mntnr/name-your-contributors

posting it as an addition to the possibilities already mentioned in this and this previous comment.

larsoner commented 11 months ago

@agramfort, @drammock and I did a little bit of investigation. When you google authors (we tried @jasmainak and @britta-wstnr for example) you do not get links to source files, even if you go pretty deep in the results. You do (sometimes as the first result!) get links to examples/ and tutorials/. Based on this, the cumulative discussion and ideas above, and some additional discussion today with @britta-wstnr, @wmvanvliet, and @jasmainak (hooray sprint time making chatting much easier!) we came up with the following tentative plan.

  1. Remove names from mne/ source files , replaced with:

    # Authors: MNE-Python developers <https://mne.tools/stable/credit.html>
    # License: BSD-3-Clause

    Source files are not easily found by Google so don't give great credit anyway. Also sometimes people email the authors for help with code which is a bug not a feature -- people should use standard communication channels (e.g., discourse).

  2. Create a GitHub action that:

    1. Populates a doc/credit.rst that is roughly organized with public submodules as headings. Under each heading is a list of all authors of the file (ordered by credit, probably +/- lines over the history). This should be at least as discoverable / reportable for people who need it as having the names in the source files since it will be on our webpage, and we can link to it directly from the front page somehow.
    2. Populates examples/ and tutorials/ Python files with all contributor names to a given example/tutorial. Since these files are already discoverable by search, we can easily keep it this way by preserving and extending/completing these automatically.

Hopefully this preserves credit "backward compatibility" sufficiently while improving things overall, especially in terms of needing to make decisions about what to include or exclude with each contribution/PR. Someone (me probably!) "just" needs to implement this stuff but really I don't think it will be too onerous having played around a bit in #11975 and with GitHub actions recently. And critically it's extensible and adjustable as needed, for example we can add sections on grant support and Discourse activity that we currently don't credit sufficiently (if at all).

cbrnr commented 2 months ago

Meanwhile, Scikit-Learn has had a similar discussion, and they decided to phase out individual author names in source files: https://blog.scikit-learn.org/updates/authorship-info/

cbrnr commented 2 months ago

Do you think we could find a solution for the upcoming release? My simplified proposal is:

  1. Remove individual authors from source files (but not from examples and tutorials!).
  2. As @larsoner suggested, we can include a URL in the header that links to contributors. I wouldn't bother trying to create something super fancy, let's just move contributors from the bottom of our landing page to https://mne.tools/contributors.html for now. That way, we can always add more things to that page later. I know that's not the same level of detail, but I think we have to find a balance that's not putting too much work on devs to implement, and I don't think we should wait another release.
larsoner commented 2 months ago

I think everyone is on board with (1) assuming there is some suitable replacement for the source-file credit, but I don't think (2) qualifies as a suitable replacement according to the previous discussions. I have some code somewhere sitting around for https://github.com/mne-tools/mne-python/issues/11605#issuecomment-1745771121 that I can try to polish, or at least open as a WIP PR. That way if someone is really motivated to work on this (sounds like you might be @cbrnr ?) then they could push it over the finish line.

cbrnr commented 2 months ago

I thought that in the light of the latest developments in Scikit-Learn, we might be able to reconsider (2). They don't implement such a highly specific credit system, and I don't know any other project that does. So I'm really proposing to follow what other projects are doing to avoid creating too much work for us devs.

drammock commented 2 months ago

in the light of the latest developments in Scikit-Learn, we might be able to reconsider (2). They don't implement such a highly specific credit system, and I don't know any other project that does.

I think any project that uses https://allcontributors.org/ or https://github.com/mntnr/name-your-contributors is probably doing a better job than either us or sklearn. So while I'm all in favor of looking to other members of our ecosystem for inspiration on ways we could improve, it's not clear to me that emulating sklearn is an improvement.

As I said in https://github.com/mne-tools/mne-python/issues/11605#issuecomment-1522436320:

I wonder if it would help to try to separate the question of "what would be optimal going forward?" from the question of "what would preserve or improve on the imperfect situation we've inherited?"

which @cbrnr agreed with (at the time at least). If we're going to do incremental steps, let's make sure they're steps forward from the (possibly marginalized/disadvantaged/unprivileged) contributors' point of view, not only from the maintenance burden for current devs view. IMO, removing the names from the source files and also moving the contributor avatars from our homepage to some other page seems like two steps in the wrong direction (i.e. both are reducing contributor visibility).

cbrnr commented 2 months ago

Then just keep the contributors on the main page? I feel like we already list all contributors in the release notes, and we specifically highlight new contributors, so I think there's no need for more complexity.

hoechenberger commented 2 months ago

Why not go all-in and write some tooling that extracts the contributors to each source file from Git and then adds all of them to the top of the file. We now have experience with CI-generated file changes with autofix.ci.

i still think there are much better ways to create contributor visibility but we've embarrassingly been unable to reach a consensus in over 9 months.

FWIW MNE-BIDS just removed individual authors.

cbrnr commented 2 months ago

I don't like the idea of more tooling. Maybe this has been mentioned before, but why is

https://github.com/mne-tools/mne-python/graphs/contributors

not sufficient? It shows all contributors with commit-level granularity.

cbrnr commented 2 months ago

What some maybe fail to realize is that the current status quo is deeply unfair and inconsistent. So it doesn't really help to qualify all suggestions as steps in the wrong direction, because doing nothing is actually larger steps in the wrong direction IMO.

hoechenberger commented 2 months ago

the current status quo is deeply unfair and inconsistent. So it doesn't really help to qualify all suggestions as steps in the wrong direction

I agree

sappelhoff commented 2 months ago

FWIW MNE-BIDS just removed individual authors.

to explain, I have removed them (with approval) for these reasons (which have all been mentioned above):

  1. inconsistent status quo (some authors missing on some files, not all files had an authors list, things got outdated ...)
  2. author list in files was sort of misleading (say, one person wrote 90%, the other 10%, yet both are listed as equal)
  3. a statement like "this is from US, all the contributors as a team" sends a nice and concise message (rather than, we work as a team, but this is only by people A, B, C), IMHO
  4. original goal of having individual names is better covered by other means (see below)

I find it very important that FOSS contributors receive credit for their contributions, and I believe this can be accurately tracked by existing tools (which have also all been mentioned above):

  1. git blame (and other git commands to inspect the git data)
  2. the changelog we publish on the website
  3. GitHub "tools" to inspect the git data, such as the contributors graph, or a author specific commit history ... or even linking to PRs
  4. the CITATION.cff file
  5. the Zenodo archive (based on data from the CITATION.cff file)
  6. a CODEOWNERS file (we currently don't have that, but might add it in the future)
  7. the paper in JOSS (but this is actually problematic, because not all contributors are authors on the paper, only those that had contributed up to the time of publication)

☝️ picking one of these methods is bound to give a contributor the granularity at which they want to advertise their contributions. I also believe that at least a few of these methods are suitable for non-technically versed people who want to judge the amount of a person's contributions.

For the future, I wish that we could become better (at MNE-BIDS) at tracking different kinds of contributions (contributor roles), that are potentially not picked up by the git history (e.g., informal conversations at meetings, help in the user forum, offline design work that is then committed by someone else, ...). A promising direction in that regard is the potential inclusion of "contributor roles" within the CITATION.cff standard. Note that we already add authors to the changelog if they haven't committed but contributed in another way:

Manually add authors who have done a lot of work (reviewing, advising, ...) but don't show up in the shortlog because they did not submit PRs

I furthermore think MNE-BIDS would profit from having a dedicated "contributors" page, like MNE-Python has on the bottom of the landing page.

Please note that all that I mention above is with regards to MNE-BIDS, which is a much smaller code base with much fewer contributors, so we can't copy all arguments one to one to MNE-Python.

drammock commented 2 months ago

A real-time discussion with @britta-wstnr and @larsoner today yielded the following proposal:

  1. create a page on our website that, when called with a contributor's GH handle as a URL parameter, will dynamically create a summary of their contributions to MNE. Candidates for inclusion are:
    • number of merged PRs (better than number of commits because it avoids the imbalance created by switching to squash-merging around 2016 or so)
    • number of PR reviews
    • some summary of forum participation (exact metrics TBD)
    • leadership activity (should be API queryable from GH teams; or at least we can make it so that this is possible going forward)
    • participation in sprints or GSoC (easy enough to maintain a YAML or JSON of this data)
    • teaching of MNE workshops (again, manually-maintained JSON or YAML, but burden should hopefully be quite low)
    • obtaining grants or other funding
    • possibly other metrics, either now or later
  2. Each metric gets a sentence or two, contextualizing what is shown (e.g., whether the sprint had competitive admission like our training sprints, or was invite-only). We can discuss whether displaying ranking for each metric is helpful/informative context or not.
  3. The whole page gets some contextualizing info too (e.g., specifically mentions what we don't (or can't easily) track; mentions known biases in the metrics, etc).
  4. Info is pulled via monthly cron job and dumped somewhere, and the dynamic page draws from that data source (the API queries will be too slow to do on every page load)
  5. target is to create and merge this shortly after the next release, so that we have plenty of time to fine-tune it before it becomes part of the stable docs site.
  6. until it's launched, we don't remove names from source files.

@larsoner and I are willing to lead the effort on building this, though @hoechenberger has indicated willingness to work on a "certificate of contribution" so hopefully this is close enough to that idea that he'd be willing to help out with this approach too.

hoechenberger commented 2 months ago

@drammock I'm happy to work on a certificate template and also to help develop and deploy the API, I've gained some experience in both backend and frontend development including cryptography over the past couple of years

let me know if we should connect for a call to discuss and assign tasks

Great to hear we're finally moving forward here!

cbrnr commented 2 months ago

Sounds like a plan! I'm fine with reporting merged PRs, but just to clarify:

number of merged PRs (better than number of commits because it avoids the imbalance created by switching to squash-merging around 2016 or so)

This means that after we switched to squash-merging, every merged PR equals a single commit, right? So technically, the imbalance would not affect any new contributors, just very old contributions, right?

drammock commented 2 months ago

after we switched to squash-merging, every merged PR equals a single commit, right? So technically, the imbalance would not affect any new contributors, just very old contributions, right?

exactly right. Counting commits biases in favor of contributions prior to the squash-merge transition, and unevenly so (bias is stronger if a person commits frequently, regardless of how complex or extensive the work was)