Simplify license headers in source files

cbrnr commented 1 year ago

Currently, our source files show individual authors as well as the license in a comment near the top, for example:

# Authors: Jaakko Leppakangas <jaeilepp@student.jyu.fi>
#          Robert Luke <mail@robertluke.net>
#
# License: BSD-3-Clause

I think this is not ideal for the following reasons:

It is not trivial to keep the list of authors up to date. Since this information is available from Git anyway, it is also not necessary.
I don't think we have always updated these lists, so currently this information is likely outdated.
There's always the question what kind of change(s) qualifies an author to get added to that list.

I suggest that we replace our headers with the following license text across all source files:

# Authors: MNE-Python development team
# License: BSD-3-Clause

The exact wording can be changed of course (feel free to make suggestions), but the point is to get rid of individual authors.

sappelhoff commented 1 year ago

This was also brought up by @drammock in https://github.com/mne-tools/mne-python/pull/9569#issuecomment-879969307 some time ago.

I generally agree. Do we need a copyright notice in each file? Isn't the license in the root of the repository enough, to cover it all?

cbrnr commented 1 year ago

Removing is even better. I don't think every file needs to have the license when there is a license in the root.

larsoner commented 1 year ago

I don't think every file needs to have the license when there is a license in the root.

I was originally on board with this until I read this stackexchange thread where people sometimes (usually?) generally recommend that you do include the license so that a copied-away file still makes it clear what the license is. So I'm +0 on removing the BSD 3-clause line, I'm fine with keeping it or removing it.

Last time the "authors" stuff came up, IIRC (I can't find the thread, maybe it was in person at a dev meeting!) most people thought "yes these should go away" until @agramfort pointed out that it's a nice way for newer contributors to get some credit for what they do. I think we conveged on getting rid of the authors in mne/ but keeping those lines in examples/ and tutorials/ because -- even though they, too, end up out of date -- it's nice for newer and/or less experienced people to have their name visible on something that people are more likely to read (i.e., an example or tutorial) than our whats_new.rst.

cbrnr commented 1 year ago

Good points @larsoner. Then I'm also +0 on removing the license line.

I'm also OK with removing authors only in mne/. If we keep them in examples/ and tutorials/, we should define what kind of changes warrant adding a new author. Are we OK with a typo fix? Or does it have to be more substantial? This discussion could be avoided if we remove authors in all sources, which I'd still prefer. We now highlight new authors in our changelog, which I believe we didn't do when @agramfort made this comment.

larsoner commented 1 year ago

We now highlight new authors in our changelog, which I believe we didn't do when @agramfort made this comment.

Looking back, looks like it was actually @rob-luke who advocated for keeping the examples/tutorial authors here in the same thread linked to before in July 2021 (I had just missed it). And back then we already were highlighting new contributors in the release April 2021 (and even if we didn't, I think the changelog is a lot less visible to end-users in general than examples/tutorials).

we should define what kind of changes warrant adding a new author. Are we OK with a typo fix? Or does it have to be more substantial?

I think people will generally be self-policing here and add their name if they think they did something worthwhile. I think we can take it case-by-case if something concerning does come up rather than trying to codify a policy here.

drammock commented 1 year ago

I think the changelog is a lot less visible to end-users in general than examples/tutorials

this is the point (originally made by @rob-luke as @larsoner mentioned) that convinced me to keep authors on tutorials/examples, even though I still dislike the fact that they can get stale/out of date/inaccurate.

I like the idea of removing authors from the source .py files within mne/ though. Anyone who is reading the source code I think it's fair to assume they at least know about changelogs, and many will know about git blame too. Plus GitHub shows avatars of all authors if you view the source on GitHub, so that's another way authors are visible for source files.

I'm +0.5 to keep the license lines.

agramfort commented 1 year ago

my take on this and why I think it's valuable to keep names:

having your name on a file gives a sense of code ownership and responsibility.
names on top of files survives github and where the code is stored.
I've been contacted by people to ask questions in sklearn code as they found my name and email on the files.
for some contributors it's important to them to have their name. There can be a sense of pride to have your name there and can maybe encourage them to contribute more.

my 2c

Message ID: @.***>

drammock commented 1 year ago

I've been contacted by people to ask questions in sklearn code as they found my name and email on the files.

I have too, but this is one reason I think it's bad to have the names there, because:

in general I think we want to discourage users from sending emails directly to specific devs --- questions/support requests should go through our preferred channels (discourse / GitHub) so that currently active maintainers will see the message.
Assuming names are in chronological order, if folks email the first name on the list they are in fact more likely to get someone who hasn't touched that code in a while.

having your name on a file gives a sense of code ownership and responsibility

Is this your personal feeling, or have contributors told you that they feel this way? (how many?) If we want to cultivate that feeling of responsibility, we could enforce / strongly encourage using the GitHub codeowner feature

for some contributors it's important to them to have their name. There can be a sense of pride to have your name there and can maybe encourage them to contribute more.

is the sense of pride from seeing your name in the file itself any greater than seeing your name on the associated commit(s)? I feel like we don't really know whether people are adding their names because they feel pride, or because they saw other names and assumed they were supposed to do it.

jasmainak commented 1 year ago

names on top of files survives github and where the code is stored.

I find this to be the most important reason to keep names ... when files are re-organized "git blame" stops working. E.g.: https://github.com/mne-tools/mne-python/blob/main/mne/report/report.py ... I don't see my name in the git commits

even though I might be able to answer questions related to the MNE-report better than the average contributor. For functionality that is old, this may be even more critical as it might be hard to dig up who originally wrote that part of the code.

drammock commented 1 year ago

ok, I withdraw my objection to names in the source files then. It's weird that you aren't listed by GitHub as a contrib @jasmainak; you show up in the blame of that file:

Screenshot 2023-03-28 at 15-12-04 mne-python_mne_report_report py at main · mne-tools_mne-python

drammock commented 1 year ago

So summarizing what I think is an emerging consensus (?):

It is not trivial to keep the list of authors up to date. Since this information is available from Git anyway, it is also not necessary.

There are social reasons (pride, fame/publicity, guilt/ownership) to keep them there.

I don't think we have always updated these lists, so currently this information is likely outdated.

The harm caused by outdated information is minimal / consists of inactive devs getting unwanted emails, and/or users having their emails go unanswered. I still think this could be avoided; maybe we include names but not emails?

There's always the question what kind of change(s) qualifies an author to get added to that list.

@larsoner suggests that the honor system / letting people decide for themselves is good enough here. I won't quibble with that.

So in the end maybe there is nothing to be done here @cbrnr ? Unless you want to push for removing just the emails and keeping the names... but IDK if that will be any less contentious.

cbrnr commented 1 year ago

I still think names in the source code only make sense if they are kept up to date. I rarely bother to put my name there, but now this makes it look as though I did not contribute. I can imagine that this is the case for others as well.

@drammock made some good points: people should not be looking at the source to find developers and ask questions. They should use our official channels. If these are not obvious, we can make them more obvisous in our documentation.

The sense of ownership and responsibility can be instilled by other means, and I think we should try to find something that works automatically. The spirit of open source is not that someone "owns" a piece of code, but we as a community have a shared responsibility for everything we create in MNE-Python. Putting some names in the sources seems to be biased and unfair unless we keep everything up to date all the time (which is not feasible).

Even if some previous contributor knows most about a specific topic I trust current developers to either answer the question or forward it to someone who they think will be able to help.

So I still think that removing names (but keeping the license) in mne/ (but keeping them in examples/ and tutorials/) is the better option.

agramfort commented 1 year ago

@cbrnr i would say add your name to the files you feel a clear ownership for. I don't want your name in the files then don't.

cbrnr commented 1 year ago

i would say add your name to the files you feel a clear ownership for. I don't want your name in the files then don't.

It's not about me. It's that I don't like it in general for the reasons I mentioned.

So this is settled? No chance that something can be done here?

cbrnr commented 1 year ago

Regarding the comment by @jasmainak, you show up in the blame of the file you linked to, so git blame does not stop working when files are re-organized. GitHub shows the most recent contributer per line. It's the list that's buggy, we should probably report that to GitHub. Because that file was renamed, you need to go to the previous file, which is linked in the list of commits.

But this file demonstrates the issue I have: @hoechenberger and @larsoner have contributed a lot, but they are not mentioned at the top of the file. Whether or not that's intentional is also beside the point, because if there is a list this indicates to others that these are the developers who contributed most and are responsible for that file. Which is simply not true.

cbrnr commented 1 year ago

Here's a shell command that prints out all contributors of a file:

git --no-pager log --follow --pretty=format:"%aN <%ae>" mne/report/report.py | sort | uniq

The important bit is the --follow option, which makes it work across renames.

If you want to preserve the order of contributions:

git --no-pager log --follow --pretty=format:"%aN <%ae>" mne/report/report.py | nl | sort -uk2 | sort -nk1 | cut -f2-

Dimitri Papadopoulos Orfanos <3234522+DimitriPapadopoulos@users.noreply.github.com>
Eric Larson <larson.eric.d@gmail.com>
Daniel McCloy <dan@mccloy.info>
Mathieu Scheltienne <mathieu.scheltienne@fcbg.ch>
Clemens Brunner <clemens.brunner@gmail.com>
Mathieu Scheltienne <mathieu.scheltienne@gmail.com>
Richard Höchenberger <richard.hoechenberger@gmail.com>
Alex Rockhill <aprockhill@mailbox.org>
Jeff Stout <jstout211@yahoo.com>
Marijn van Vliet <w.m.vanvliet@gmail.com>
Alexandre Gramfort <alexandre.gramfort@m4x.org>
Stefan Appelhoff <stefan.appelhoff@mailbox.org>
Rob Luke <748691+rob-luke@users.noreply.github.com>
Valerii Chirkov <42982039+vagechirkov@users.noreply.github.com>
Martin Schulz <dev@earthman-music.de>
Guillaume Favelier <guillaume.favelier@gmail.com>
Dmitrii Altukhov <dm.altukhov@ya.ru>
Alex Rockhill <aprockhill206@gmail.com>
Daniel McCloy <dan.mccloy@gmail.com>
Joan Massich <mailsik@gmail.com>
Jaakko Leppakangas <jaeilepp@gmail.com>
Mainak Jas <mainakjas@gmail.com>
Denis A. Engemann <denis.engemann@gmail.com>
Teon Brooks <teon@nyu.edu>
Teon Brooks <teon.brooks@gmail.com>
Jean-Remi King <jeanremi.king+github@gmail.com>
Mainak Jas <mainak.jas@telecom-paristech.fr>
Denis A. Engemann <d.engemann@fz-juelich.de>

hoechenberger commented 1 year ago

But this file demonstrates the issue I have: @hoechenberger and @larsoner have contributed a lot, but they are not mentioned at the top of the file. Whether or not that's intentional is also beside the point,

I'm okay with having the names of the original authors of a module / function listed on top. I don't care if I'm listed even if I contributed a lot, for as long as the commit history is preserved. Should we at some point be forced to kill the history, we should add all contributors to the top.

Makes me wonder about the planned transition to Black though, where we'll touch many, many lines of the code. I'm wondering if we could associate this gigantic commit with a "virtual" contributor / bot (instead of a real developer account). But this just as an aside.

cbrnr commented 1 year ago

I'm okay with having the names of the original authors of a module / function listed on top. I don't care if I'm listed even if I contributed a lot, for as long as the commit history is preserved. Should we at some point be forced to kill the history, we should add all contributors to the top.

But some (new) contributors might care, but not dare to ask to include their name because they might think that only original contributors or "worthy" contributions should go in this list.

That's exactly my suggestion, either we copy the full list to the header, or none. Everything in between is not fair.

Coming back to a point I thought we'd converged on (also in a previous discussion): I would leave authors in tutorials and examples, but not in sources. This was the consensus in the linked discussion in my opinion.

Re Black, it is possible to ignore revisions - is that not sufficient? GitHub does support it (https://docs.github.com/en/repositories/working-with-files/using-files/viewing-a-file#ignore-commits-in-the-blame-view).

britta-wstnr commented 1 year ago

I agree with @agramfort here: it can be valuable for some, especially junior, developers to have their name in the source files. In my opinion, this should be marked by "substantial contribution".

The reason why I think this is valuable is the following: having a name on a source file is an easy way to showcase that one has substantially contributed to a specific method etc. Recruitment committees of various disciplines will probably not use a command line tool or other git/GitHub functionality to figure out contributions. However, adding a link to a file where your name is up to is easy to do. Furthermore, people who look at the source code locally will make a connection with the names - thus increasing visibility for the contributors.

Now I agree with @cbrnr that the practice how this is done is disadvantaging contributors who are less inclined to just put themselves on the files. I wonder if what we actually need is guidelines what makes a contribution substantial enough to have your name on the file (I also see how this would always be an arbitrary cut-off).

I am against having all names up, as this would add a lot of text for potentially less substantial commits (e.g., having fixed a typo in the docstring etc.) and would decrease visibility.

Another possibility could be to outsource the contributor list to another file we keep on our homepage - that could fix it for my first point, but again not for the visibility.

hoechenberger commented 1 year ago

But some (new) contributors might care, but not dare to ask to include their name

Ok right, very valid point. I've been in this situation before and I never knew when it's justified to add my name.

drammock commented 1 year ago

git --no-pager log --follow --pretty=format:"%aN <%ae>" mne/report/report.py | nl | sort -uk2 | sort -nk1 | cut -f2-

If you can add this to a commit hook that automatically runs only on the changed files and then auto updates the author list (and deduplicates with our mailmap!) then I think that would be quite a good solution. Names who only made small typo fixes will still be in there but I'd rather err in that direction than to not include names who made large contributions

britta-wstnr commented 1 year ago

Mh @drammock I'd argue that then the name list become meaningless, no? For all points mentioned here: ownership, visibility, pride, ...

cbrnr commented 1 year ago

I'd argue that then the name list become meaningless, no? For all points mentioned here: ownership, visibility, pride, ...

But everyone who contributed has the same right to get ownership, visibility, etc.?

Since I wanted to avoid defining what an "important" contribution is, the best solution IMO is to remove author names from source files and try to add other means to provide these perks. For everyone.

britta-wstnr commented 1 year ago

Well, my point is that with all names on there, it would not work that way. Would someone feel ownership/responsibility if there are 15+ other names? Probably no. Would someone get visibility if there are 15+ other names? Probably also no.

But I do think that if someone made substantial changes to a part of the code base, they should be able to get recognition for that beyond having their name in a long list of names or not mentioned at all. At this point I also want to throw in that this might be especially important for marginalized groups in open source, who might not be just "believed" to have "really contributed".

sappelhoff commented 1 year ago

Having observed this a bit, I want to revise my opinion. I think we should:

use CODEOWNERS for "ownership" --> this allows people to get notifications when files are touched
use our changelog for "credit"/"attribution" --> after all it's more relevant to be able to point to a specific PR that you have implemented, rather than say "I worked on this file in general"
- I think what would be really awesome would be a way to filter the changelog by contributor ... something like: list me all contributions by @example-person across all MNE versions
- ideally, such a query would be associated with a specific URL that could be shared. Then I could send someone a URL and they could see all my contributions with links to the specific PRs.
git blame for specifics (the "black" commit can be filtered via a .git-blame-ignore-revs file, which does also work on GitHub, next to the CLI)
keep author names in tutorials and examples, AND write some loose guidelines (preferably examples) when a contribution is considered big enough to add a name there ... OR add a "author contributions" section to each tutorial/example ... something like the CREDIT taxonomy
remove names and emails from files, and only keep the license abbreviation there --> credit/ownership/blame is IMHO properly solved by the above mentioned items

cbrnr commented 1 year ago

I agree with @sappelhoff 100%!

hoechenberger commented 1 year ago

@britta-wstnr

But I do think that if someone made substantial changes to a part of the code base, they should be able to get recognition for that beyond having their name in a long list of names or not mentioned at all.

While I do agree, I also see the difficulty here that @cbrnr mentioned: how do we quantify that? How big or important is "substantial"? Changed lines of code (how many?)? Performance improvements (how much faster?)? Continuous maintenance of a module (for how long?)? Do the barriers we set here depend on seniority of the contributor (beginners -> lower, senior devs -> higher barrier?)?

There must be a clear path that contributors can follow to get their name added at the top-level. Otherwise it'll be a recipe for conflict and disappointment.

britta-wstnr commented 1 year ago

@hoechenberger as mentioned in one of my earlier comments, I agree with that as well. However, that also is a problem for tutorials and examples as @sappelhoff acknowledges and proposes to write guidelines for - i.e., we have to solve that anyway.

drammock commented 1 year ago

@britta-wstnr

But I do think that if someone made substantial changes to a part of the code base, they should be able to get recognition for that beyond having their name in a long list of names or not mentioned at all. At this point I also want to throw in that this might be especially important for marginalized groups in open source, who might not be just "believed" to have "really contributed".

I am very sympathetic to and supportive of these points. But until we have a proposal for how "substantial" is defined and adjudicated, I feel I cannot support inclusion/exclusion of any names --- either show all names, or show none.

I think several of us agree that letting the contributor decide whether or not to include their name isn't a very good solution: it disadvantages anyone feeling shy/humble/insecure and advantages anyone feeling arrogant/confident/entitled. But I also really don't want the mechanism to be "whoever is reviewing the PR decides" --- clearly there are differences of opinion among members of our steering council about how this should be handled, and whether a contributor gets credit in this way should not be determined by luck of the draw in who is available for PR reviews.

@sappelhoff

ideally, such a query would be associated with a specific URL that could be shared. Then I could send someone a URL and they could see all my contributions with links to the specific PRs.

This is already available: https://github.com/mne-tools/mne-python/commits?author=sappelhoff

keep author names in tutorials and examples, AND write some loose guidelines

See above point; "loose guideline" does not, IMO, eliminate the bias toward advantaging some contribs and disadvantaging others, and it adds overhead for devs to adjudicate (or at least "keep an eye out" for whether names were (not) added (in)appropriately).

cbrnr commented 1 year ago

Examples and tutorials are typically touched by far fewer individuals, so it should not be a problem to include all names. Even when someone "just" fixes typos (BTW I think that fixing typos is extremely valuable). So our guideline should be to include everyone who modified an example or tutorial.

cbrnr commented 1 year ago

There's also https://github.com/all-contributors/all-contributors if that's something you would want to implement. I'm not sure I like it that much, but it's one way to increase our focus on contributors.

britta-wstnr commented 1 year ago

To make my position clearer:

I do not think that fixing typos is "not valuable". But I think we need to find some way to give people credit who spend a lot of their time on open source work. And in a way that has little overhead for them and is easy to understand by assessment committees etc.

I fully agree that the way we do it now is not fair.

However, just erasing the names does not fix that. Adding all names does not fix that either. --Let me be a bit more specific: if we add all names, it will again be the ones that feel less confident who will suffer. They will not feel like they can put their work in a CV (or elsewhere), because they have nothing to back it up with. After all, the names now just mean you touched a file at some point.

So in my opinion, we need to fix the fairness of the name lists (or come up with an entirely different system). Because dropping them/putting all names is not fair either.

hoechenberger commented 1 year ago

I suppose the easiest objective measure is the number of lines a contributor touched... Can be absolute or relative (percent of a module)

jasmainak commented 1 year ago

My understanding of the names on the top was that it is not just for visibility, but also responsibility. Those who put their names at the top of the file take some level of responsibility for the code having substantially contributed to it and understood how it works. For instance, I would be uncomfortable putting my name in bem.py even if I may have refactored some functions ... I may not be able to help someone figure out how the code relates to the equations in Munck's paper. Presumably, by the time someone is making such substantial contributions, they would not be shy adding their name. For small and new contributions, the changelog is always there.

cbrnr commented 1 year ago

@jasmainak that's what codeowners is for.

jasmainak commented 1 year ago

Codeowners is for the entire repo or file-level? Would you know who to ask if something is not clear in some part of a particular file?

cbrnr commented 1 year ago

It's file-based. You specify patterns to assign code to a person.

hoechenberger commented 1 year ago

Would you know who to ask if something is not clear in some part of a particular file?

Honestly in cases like this I'd just open an issue and maybe tag the dev(s) who touched that code last, as reported by git blame

hoechenberger commented 1 year ago

I feel we're digressing. I believe we didn't really mean to discuss things like, which commits belong to whom and who is responsible for which subsystem / module – all stuff that's handled for us via git and GitHub; but we wanted to find ways to acknowledge contributors. Let's try to focus on this aspect! :)

Edit: Let me correct: I feel that I'm digressing 😅

jasmainak commented 1 year ago

git and github provide powerful tools for tracking commits, modules etc. But at the end of the day, these tools cannot track everything. I have known of cases where someone contributed during a pair programming session and their contribution had to be acknowledged in other ways ... yes, you can use co-authored commits but in some cases, these people may not even have a github account since they are senior researchers. As Britta also pointed out above, committees may not be so git-savvy and may only look at the header of the files superficially. I think we need to strike a balance between being easy for developers vs easy for users/non-developers in understanding who is contributing and/or responsible for the code.

cbrnr commented 1 year ago

Can we move this discussion forward? It seems to me a good compromise between completely removing all authors in source files vs. keeping a list of (selected) authors is to:

keep authors in examples and tutorials and remove them in other sources.

What is everyone's opinion on this suggestion (react with :+1: and :-1: or reply if you have more to add).

drammock commented 1 year ago

There was discussion of this a couple weeks ago at a dev meeting. Unfortunately nobody took notes, so I'll do my best to reconstruct the conclusions reached:

the current way things are done is both inaccurate and unfair
simply removing the names (from either source files or examples/tutorials) risks doing some harm to folks whose names are there (i.e., if they rely on their name being there as public demonstration of their contributor efforts).
until we have a proposal that does not risk such harm, no action should be taken.

cc @britta-wstnr @agramfort @larsoner who I think were the ones present for that discussion; please correct me if I'm mistaken.

cbrnr commented 1 year ago

I already showed a technical way to extract all authors for a file using git. So this information is there if anyone needs it. Can we make a list of persons who require this information? If there are not too many, maybe individual custom solutions are feasible (like generating a report of contributions). Are there other suggestions?

britta-wstnr commented 1 year ago

I already showed a technical way to extract all authors for a file using git.

We have discussed above why this is not necessarily a solution everybody can or will take (e.g. committees).

cbrnr commented 1 year ago

We could create that solution (a report) for them.

mscheltienne commented 1 year ago

A summary/report of the contribution made by someone is available on GitHub as pointed by @drammock

This is already available: https://github.com/mne-tools/mne-python/commits?author=sappelhoff

But maybe it could be an addition to the website with a search-like tool. It would make this summary more accessible and shareable.

drammock commented 1 year ago

maybe it could be an addition to the website with a search-like tool. It would make this summary more accessible and shareable.

hosting such an interface on our website would also allow us to curate / customize what metric(s) we choose to display. Some possibities (besides number of commits):

number of PR reviews
number of issues reported
sprint or GSoC mentorship
IDK if Discourse has an API but if they do we could potentially pull in forum stats too
Getting funding to support the project

Separately, I wonder if it would help to try to separate the question of "what would be optimal going forward?" from the question of "what would preserve or improve on the imperfect situation we've inherited?" My hope in suggesting this is that we could reach some consensus about what is future-optimal, and then add to it an appropriate "patch" to try to minimize whatever harm it would do if we ignored the imperfect current/past state of affairs. Thoughts?

hoechenberger commented 1 year ago

IDK if Discourse has an API

They do, https://docs.discourse.org/

cbrnr commented 1 year ago

I like the suggestion to separate the optimal solution from incremental improvements. Otherwise, we might not implement anything at all for a long time. Again, I would like to ask if anyone can give concrete examples of why and which specific statistics are important to have. This would greatly help us reduce efforts in the spirit of YAGNI.

A simple and quick (but maybe still good) improvement could be to archive all author headers in a common file, so that this information is still visible for anyone who needs it. This could be a simple text/Markdown file demonstrating contributions from particular authors (those who happened to be in the file headers) to particular files. Then we could remove authors from sources (but keep and encourage them in examples/tutorials).

sappelhoff commented 1 year ago

A simple and quick (but maybe still good) improvement could be to archive all author headers in a common file, so that this information is still visible for anyone who needs it.

don't we have something like that here? https://github.com/mne-tools/mne-python/blob/909e45821bd004b6072a6426e4e8ea88e51b72ad/doc/changes/names.inc#L1

mne-tools / mne-python

Simplify license headers in source files #11605