Closed lehors closed 1 year ago
Some additional thoughts (leverage what you will)
I have a whitepaper I'm working on that involves a lot of npm analysis.
There have been 28 million npm package releases (this is all packages times all versions). Of all those releases, 16 million (that is not a typo, 16 with six zeroes) have one maintainer.
It's very easy to argue that more maintainers is better, but if we want to start pushing a narrative that one maintainer is bad, there's no way to find the millions of developers needed to fill in the gaps.
I would score this as more than one gets you bonus points, but only one doesn't detract.
I disagree, we should base the definition off of what we believe healthy should be, not the reality of what exists today. Greater than one maintainer is just table stakes for a healthy project/community.
I think your comment on what we believe is the key to this.
Most of what the community thinks about these data points are based on conjecture. It's easy to imagine the number of maintainers is important, as is the number of github stars, and the number of forks, or commits. But how important are they? I don't think we really know.
The next conclusion humans come to is "let's just pick some numbers then adjust as we figure out what's correct", but that leads to bias in the research and reporting. It's unlikely we can look at any of this data without bias at this point to be honest, most of us are in too deep. An unbiased third party is almost certainly needed.
I would love to see the OpenSSF conduct some research on the various signals that exist and how important they are or are not. The scorecard project is doing some of this, but I'm not familiar enough with them to know if the work would stand up to scientific rigor.
For anyone interested, here's what the graph looks like for number of npm maintainers for released packages, limited to 1-100 maintainers (there are some projects with more than 100, but it just makes the graph hard to read)
"More than 1 maintainer" is better even without deep scientific analysis. That one maintainer may die or lose interest.
However, a vast number of OSS projects are single maintainer, and it's not really something the maintainer can fully control - you have to convince others to join. So "1 maintainer project" is a higher risk, but it's not automatically an indicator that the project is poorly run.
I agree that better scientific rigor would be a really good idea. That is hard & time-consuming. We should strive for that long-term, but in the short term at least have multi-stakeholder view with multiple communities until we get better data.
There are indeed many npm packages with a single maintainer but as far as I know that's generally considered to be a problem and experience shows that there is definitely a risk associated with depending on such packages.
A quick addition: the OpenSSF Best Practices badge requires multiple developers at the gold level, but not the passing or silver levels. So many projects don't have multiple developers & it's not really under the developers' full control, so we focused instead on things that developers can control. Also, I think there's an argument that if a project takes steps to do other things better (add automated tests, tools, etc.), they're more likely to attract added developers.
I am curious about the maintainer metrics w.r.t top/critical packages, like how do the maintainer count look for top 10k/50k packages (the long tail is usually unused, hoppy projects). Also, i agree that single maintainer projects are super hard to secure, so we can't recommend them in any reasonable way OR they shouldn't be used in any critical applications (single-maintainer=no code review, could lose interest / become malicious at any point, slow response / bottleneck, etc)
I love this discussion. I also want to add that having at least one name or handle in the CODEOWNERS.md file may not be enough, healthy can also constitute "attentive". That is different than "active" as active often implies multiple commits. If a project has done the thing it needs done, and determines there is no need for additional features, we won't see a lot of commits from them. Maybe a refactor here or there and some dependency updates, but it's the fact that they are attentive to the project that gives us a view on their health potential.
I think it's worthwhile to also look at some of this with respect to the end user. I think it's always worthwhile to suggest best practices, but also it's about getting information into the hands of the downstream users to help better protect their users.
I might not care if a blog I go to uses a random library with a single maintainer.
I probably care if my online banking application uses a random library with a single maintainer.
If a lot of regulated industries realized how much of their consumer facing (e.g. Javascript in the browser) code depended on single maintainer projects they would be quite scared for all the reasons everyone listed above.
It should be noted a lot of these single maintainer libraries are often not used directly by end user facing projects. They are often included somewhere in the supply chain of larger multi-maintainer frameworks sponsored by large organizations.
@inferno-chromium Here's the graph of the top 5% of packages (20,000 downloads or more), and only showing the latest released version instead of all possible versions
It still skews VERY heavily to one maintainer
@inferno-chromium Here's the graph of the top 5% of packages (20,000 downloads or more), and only showing the latest released version instead of all possible versions
It still skews VERY heavily to one maintainer
Thank you @joshbressers . This is definitely scarier than what I thought. For so many packages where one maintainers will continue to the case, maybe there is a case for maintainer reputation/credibility.
I really want us to be mindful of our wording here.
This isn't "scary", this is how open source works, and it runs the world. Open source certainly works.
Many of us have incorrect assumptions about open source and when those assumptions are broken, the human brain goes into crisis mode. Some of us will double down on our faulty assumptions, some of us sob quietly. Some of us try to find someone to blame (you can blame me, it's OK).
The job of the TAC should be to step back and try to be mindful and objective.
It's pretty clear that single maintainer open source is the norm, not the exception, even for top projects. It's also pretty obvious that more maintainers solve some continuity problems. We should pick some minimum expected maintainers number for projects to graduate out of the sandbox, but I do not think we should claim the number of maintainers denotes "health".
I strongly believe too many of our security metrics are "things smart people think is a good idea" instead of objective data
For what it's worth the wording "healthy number" comes from the CNCF project lifecycle but I see @joshbressers's point about not needing to use a term that might be seen as offensive or disconnected from reality by some.
I would however like to point out that the issue at hand isn't about how we would judge the many open source projects that are out there. The issue at hand is about whether for a project to enter incubation within OpenSSF we should require a minimum number of maintainers. For that matter I don't foresee any of the projects living within OpenSSF to only have one maintainer.
@annabellegoth2boss proposes in PR#103 to change that requirement to the following for entering incubation: "Projects must have a minimum of three maintainers with a minimum of two different company affiliations."
And to the following for graduating: "Projects must have maintainers with a minimum of three different company affiliations."
As a member of the CNCF TOC I can assuredly say that while we have this, a lot of our projects that reach graduation also demonstrate higher maturity and governance to support this. That being said we still have room to grow here, a number of maintainers is great, but can the project sustain if one of them changes.
Part of my confusion in the charter with this is that if the OpenSSF defines healthy, will it intend to only apply this to projects it brings in? Or will it also apply these to how it evaluates projects in the dashboard. In my mind these are two very separate and distinct items. What is healthy for a foundation may look entirely different than a simple function library.
This is what I want to disentangle and make sure we are clear on.
I really appreciate the numbers @joshbressers has brought in! I agree that single-person OSS projects are very common, even for projects that many depend on.
However, I believe that npm is an especially extreme case. The npm ecosystem strongly encourages extremely small projects, much smaller than other ecosystems. As a result, there are more npm modules (because it may take ~20 modules to do the work of one in other ecosystems) and there's a stronger tendency to have simple-person projects (because each one has much less code, reducing the need for more people). Harvard's Census II had to deal with this problem; it notes that "... small packages are extremely common in the JavaScript npm package system. For example, in npm, 47% of the packages have 0 or 1 functions, and the average npm package has 112 lines of code. In contrast, the average Python module in the PyPI repository has 2,232 lines of code. These two factors caused the dependency calculations to crowd out non-JavaScript packages". If accept using LOC as a very rough estimator, that means that the average Python package has 20x functionality as the average npm package (yes, LOC is a terrible estimator for this, but the difference is so stark that I think the effect is real).
I'd be curious to know what the numbers are for other ecosystems. Anyone willing to take a stab?
If someone can point me in the direction of any PyPI data, I can totally do this (I really like data)
libraries.io has data. It has problems, & I don't know if it can answer this question.
@westonsteimel pointed me to https://github.com/sethmlarson/pypi-data for the data
It's basically the same as npm.
We should probably stop trying to assume anything about any of this data moving forward. It would be lovely if the OpenSSF could create some programs to continuously track a bunch of metrics at scale. It would let us understand the current environment and also see if anything is changing.
Another angle I was thinking about that might potentially be interesting to explore is to attempt to correlate the maintainer data from the packaging ecosystem with that of the code repo. You could try and figure out answers to things like are there people with the ability to publish packages that are no longer active with the project or even try to identify active folks who don't have ability to publish packages, etc
Another angle I was thinking about that might potentially be interesting to explore is to attempt to correlate the maintainer data from the packaging ecosystem with that of the code repo. You could try and figure out answers to things like are there people with the ability to publish packages that are no longer active with the project or even try to identify active folks who don't have ability to publish packages, etc
I agree with this
I tried to get some of that info from GitHub and I couldn't figure it out (looking at GitHub data was where I landed before finding that PyPI data). We may need some help from the upstreams to construct such things, but I also imagine that under the OpenSSF umbrella that can be accommodated.
This entire conversation has been facinating to read.
Trying to get another person to commit to being a co-maintainer can be difficult. It would be unfortunate for that to kill an OSSF project. A potential compromise would be requiring at least another code reviewer to review PRs by the primary, they aren't really a co-maintainer, they'd be a quality checker though? The up side is there's someone to sign off on changes. Down side, it can significantly slow development, especially in the beginning.
Thoughts?
Here's my viewpoint, perhaps some others will agree.
From a user's point of view, an OSS project with multiple maintainers has on average a lower risk than a project with only one. If nothing else, if one becomes incapacitated, the project can continue more easily. We can argue "how many" but 2 is much better than 1. Even better would be a project that actually has a second reviewer for every commit (in practice that requires an even bigger team).
However, many OSS projects are single-maintainer. A maintainer of a project can't suddenly cause another maintainer to appear; in some ways that is outside the maintainer's control. But a maintainer can prepare a project so that it's easier to join, maintain, etc. So following best practices to encourage adding future maintainers is something desirable.
To the extent we can get real data & analysis to gain insight into this, great! There will always be things that we don't know from such processes, so we have to use expert analysis & opinion when no better information is available. I really appreciate Josh's diving into the data here.
My current thoughts on this one is two parts
First, this data shows the open source ecosystem we have today. I think there was an assumption from us all that it didn't quite look like this. This is what we have, it's not changing without some sort of forcing function.
Secondly, there is the open source ecosystem we would like to see in the future. Every project having multiple reviewers and maintainers would be amazing. If this is something we want to see, we need to find a way to make it happen. It will be a lot of hard work and require vast resources I suspect. This would also take resources from other endeavors.
Is this a problem we should solve? If yes, how would we rank this on the list for resource allocation?
I've not seen this particular issue on any of the lists of things to solve to date (I mean, we didn't really think it was a problem to solve, so I get that)
Again, I don't think because many open source projects only have one maintainer we should allow this to be the case for an OpenSSF project. Being part of OpenSSF provides a project with a level of support (services, resources, marketing and so on) which justifies setting a higher bar.
Now, @joshbressers brings a good point. Besides documenting this aspect of many open source projects out there so that people can choose to use them or not, knowingly, is there more we should do as part of the OpenSSF mission?
I will add in response to @JLLeitschuh that, to me, reviewing PRs is one of the primary responsibilities of maintainers. I know that not everyone agrees with that point of view but I think that's essential given that only them can do so and merge.
I have a whitepaper I'm working on that involves a lot of npm analysis.
There have been 28 million npm package releases (this is all packages times all versions). Of all those releases, 16 million (that is not a typo, 16 with six zeroes) have one maintainer.
It's very easy to argue that more maintainers is better, but if we want to start pushing a narrative that one maintainer is bad, there's no way to find the millions of developers needed to fill in the gaps.
How many of those have the same maintainer? E.g. What # of packages have a fraction of a maintainer? When I looked there were many maintainers with multiple packages to their name.
The data here is really interesting, but not really a surprise.
I also don't think it's really required for us to answer the original question on how we define a healthy number for projects seeking to join the OSSF. I can't see us accepting a project with one maintainer, we even already ask for multiple vendors IIRC It's also never come up, so it's hard to really decide without a clear example.
For general "health", I'm still of the opinion that multiple maintainers are required for a healthy project. I could be convinced that one primary maintainer is fine as long as there are multiple individuals that have enough permissions to keep the project going in the event the primary one is no longer available to work on the project. I've seen too many examples of projects hosted out of personal github accounts getting stuck when the owner of the account goes unresponsive to ever be comfortable calling that a healthy long term situation.
That one maintainer may die or lose interest.
There's many projects with multiple maintainers where the loss of a specific maintainer would cause serious problems for a project, so focusing on a solo-maintainer aspect as a high risk kinda ignores the complexities of project operations. It's also worth thinking of "maintainer" as a corporate entity and not just a single-human entity. There are many projects where the maintainer is, effectively, a corporation, and if that corporation changes direction or fails (happens often!), then the project is effectively also abandoned.
"Commit bit" is access to the code but not necessarily the knowledge of that code, nor knowledge or access to perform non-code activities with the project, such as publishing releases, project-specific comms accounts like a project twitter account, etc.
I think of maintainers having two main things with respect to risk and continuity: One, access privileges to make changes and perform any project duties (to the code, issue tracker, website, documentation system, publish releases, etc). Two, knowledge enabling one to use that access successfully.
You can reduce risk of "one maintainer" project under OSSF if there's safety for those two things. Perhaps a condition of incubation (for projects of any size?) is providing fallback "access" (to make changes to any/all project resources). Second, writing down the knowledge typically unique to maintainers like how to contribute, how to publish releases, or other kinds of non-development activities. As a metaphor, think of it like a small business continuity plan where we remembered to give you the keys to the building, "just in case" but forgot to provide access to the bank account which allow you to pay employees and vendors -- or even knowledge of what bank is used or how payroll was operated.
@david-a-wheeler This is most interesting, thanks
The way the data is presented is confusing to me. I am assuming the >1 metric also includes all the other groupings below it in the chart, which I think isn't the most useful way to present this data. It seems like the magic number is somewhere above 10 contributors, but that could also just be due to a huge drop off in number of projects.
I found the freshness metric to be the most interesting. That feels like it could be a useful metric for quality.
If you haven't done so, this should probably should be shown to the best practices WG, as I think it shows the importance of using code as dependency vs code as a copy.
This question is answered in the Project Lifecycle with the following: For Sandbox: "Projects must have a minimum of two maintainers with different organization affiliations." For Incubation: "Projects must have a minimum of three maintainers with a minimum of two different organization affiliations." For Graduated: "Projects must have maintainers with a minimum of three different organizational affiliations."
Are projects the same as repositories? Can a project span multiple repositories? Can there be a single maintainer of a repository as long as the repository is a part of a larger project?
Projects can certainly have several repositories. This is very common in my experience. As far as the project lifecycle is concerned, given that the criteria is defined at the project level, I would say that it is possible to have a single maintainer on a given repository, although that's clearly never desirable.
An important question that does not seem to be mentioned above is, what percentage of single-maintainer projects/packages survive after that one maintainer is no longer able to support it? Does someone else pick up the project/package and provide the basic maintenance needed to keep it working? Do a variety of users that depend on the abandoned project/package make the changes they need and submit them to the main repo? Or is the package abandoned and a replacement created as quickly as possible (by another single maintainer š).
One would argue that if a project/package has a PASSING or SILVER OpenSSF Best Practices Badge (and if the project/package was not too large) then any competent developer who knows the general domain well should be able to safely do basic maintenance on that project/package to address their needs. So that project/package would be, in a sense, community maintained (which I used to call "self-sustaining software").
Is there any research/data on the above question?
I would argue it doesn't matter that much because:
1) someone who wants to maintain it can fork it 2) someone who wants to maintain it can apply to the namespace owner to take ownership of it, I did this in Fedora with the php openid stuff for example (I simply needed it to work and got tired of hand patching it on each install) 3) I've seen VERY healthy single maintainer projects, and I've seen not as healthy projects with a lot of people, the number of people is correlated to health, but nowhere near as closely as things like release velocity or "how often do they accept outside code" and so on 4) The reality is the vast majority of opensource is a 1 maintainer or less (e.g. one maintainer doing multiple projects)
On Thu, Apr 20, 2023 at 10:35āÆAM Roscoe A. Bartlett < @.***> wrote:
An important question that does not seem to be mentioned above is, what percentage of single-maintainer projects/packages survive after that one maintainer is no longer able to support it? Does someone else pick up the project/package and provide the basic maintenance needed to keep it working? Do a variety of users that depend on the abandoned project/package make the changes they need and submit them to the main repo? Or is the package abandoned and a replacement created as quickly as possible (by another single maintainer š).
One would argue that if a project/package has a PASSING or SILVER OpenSSF Best Practices Badge (and if the project/package was not too large) then any competent developer who knows the general domain well should be able to safely do basic maintenance on that project/package to address their needs. So that project/package would be, in a sense, community maintained (which I used to call "self-sustaining software" https://bartlettroscoe.github.io/publications/AgileLifecyclesResearchCSESoftware_201310.pdf ).
Is there any research/data on the above question?
ā Reply to this email directly, view it on GitHub https://github.com/ossf/tac/issues/101#issuecomment-1516631801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEOEQ3YCYKVYVJCY2FVW53XCFQW5ANCNFSM5VPLHA4Q . You are receiving this because you commented.Message ID: @.***>
-- Kurt Seifried (He/Him) @.***
1) someone who wants to maintain it can fork it
This is the use case where someone makes modifications for themselves to be able to keep using an abandoned package for their own projects. But this does not impact the official version of the package fetched and installed by a system like NPM and ultimate sustainability. (Multiple incompatible forks of the same package is usually not a good thing.)
2) someone who wants to maintain it can apply to the namespace owner to take ownership of it, I did this in Fedora with the php openid stuff for example (I simply needed it to work and got tired of hand patching it on each install)
This seems to be the ideal situation. My question is, how often does this happen?
3) I've seen VERY healthy single maintainer projects
But how does a single maintainer project do peer review? External contributions will have the single maintainer review these but what about the changes made by the single maintainer? Who reviews those? (I guess that is why code_review_standards is a GOLD level item?)
This is the problem I have having with one my own packages (i.e. finding someone to do detailed deep semantic reviews for changes I am making).
"Open Source is Bigger Than You Can Imagine" by Josh Bressers has data about the number of maintainers per project. In short, most NPM project repos have 1 maintainer. See the post for details.
Projects can certainly have several repositories.
Yes, presuming "project" is just the normal English meaning (which is quite general). However, it's not unusual to use the word "project" in a narrower sense where a project's current state is necessarily represented by a single repo , and thus multiple repos imply multiple projects. Humans are not very good at being precise :-).
Kurt said:
I would argue it doesn't matter that much because: 1) someone who wants to maintain it can fork it 2) someone who wants to maintain it can apply to the namespace owner to take ownership of it...
It matters because of dependencies. If FOO stops being maintained, typically there's little indication of that, nor is it obvious that FOO2 is the "correct" fork that replaces it. So projects will look and see that "we're using the latest version of FOO" not realizing that FOO is long abandoned. You can apply to the namespace owner, but the owner can't easily know if you're trustworthy, and the owner might not even be alive.
On Thu, Apr 20, 2023 at 12:25āÆPM David A. Wheeler @.***> wrote:
"Open Source is Bigger Than You Can Imagine" by Josh Bressers https://anchore.com/blog/open-source-is-bigger-than-you-imagine/ has data about the number of maintainers per project. In short, most NPM project repos have 1 maintainer. See the post for details.
Minor correction: <1. A lot of people maintain more than one package. Saying a package has one maintainer I think falsely ggives people the impression that someone is working on it full time. During the XML hash-DOS stuff one core library had one person that basically replied "I have a free weekend in 2 months maybe" to which we replied "Cool. We'll do all the work now and whatever help you need to ship an updated version today if possible."
Projects can certainly have several repositories.
Yes, presuming "project" is just the normal English meaning (which is quite general). However, it's not unusual to use the word "project" in a narrower sense where a project's current state is necessarily represented by a single repo , and thus multiple repos imply multiple projects. Humans are not very good at being precise :-).
Kurt said:
I would argue it doesn't matter that much because:
- someone who wants to maintain it can fork it
- someone who wants to maintain it can apply to the namespace owner to take ownership of it...
It matters because of dependencies. If FOO stops being maintained, typically there's little indication of that, nor is it obvious that FOO2 is the "correct" fork that replaces it. So projects will look and see that "we're using the latest version of FOO" not realizing that FOO is long abandoned. You can apply to the namespace owner, but the owner can't easily know if you're trustworthy, and the owner might not even be alive.
ā Reply to this email directly, view it on GitHub https://github.com/ossf/tac/issues/101#issuecomment-1516763879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEOEQ3ROQ2VP2FQ2OOQB4LXCF5SNANCNFSM5VPLHA4Q . You are receiving this because you commented.Message ID: @.***>
-- Kurt Seifried (He/Him) @.***
someone who wants to maintain it can fork it
It matters because of dependencies. If FOO stops being maintained, typically there's little indication of that, nor is it obvious that FOO2 is the "correct" fork that replaces it.
A situation happened like this years ago when we contracted with Kitware to extend CMake and the Google Ninja build tool to support Fortran (which needs to create intermediate binary module files generated while compiling*.f90
files which are used by downstream *.f90
files which complicates the build process and parallel dependencies orderings between *.f90
files). At the time, the Ninja maintainers were hesitant to accept the changes to support this so Kitware had to maintain a fork of Ninja with the Fortran functionality for about 4 or 5 years. Having this functionality in a Kitware fork of Ninja turned out to be a pretty big impediment for some projects because they could not get Ninja from the main Google-controlled repository. (Thankfully, the maintainers of Ninja eventually accepted the patches from Kitware because they were also needed to support building C++20 projects using C++20 modules.)
So yea, there really needs to be one official version (or sequence of versions) of every package in order to sustain a large package ecosystem. Forks are a nightmare in that case.
Thatās a bit of a different scenario, though - the time when forking is ideal is when a package is dead because it has no active maintainers (whether thatās due to one, or more than one, people bailing on the role), and when the project is owned by a foundation, itās easier to handle that scenario, because the foundation can take over and grant rights to new maintainers.
The riskiest thing for the ecosystem is single maintainer projects that arenāt in a foundation - having āsingle maintainerā be a barrier to foundation entry only increases net risk.
What is healthy in regards to this? with this concept of "healthy number" also be reused as a criteria for the dashboard or criticality score? If not, why not?
_Originally posted by @TheFoxAtWork in https://github.com/ossf/tac/pull/95#discussion_r866106321_