Open nayafia opened 8 years ago
I think a good indictor is % of commits by 1 or 2 people, as well. In our case, we have an active repo, but it's mainly just the same people doing work, without a larger community helping out in the code.
On the systems side (packaged software installed using rpm/deb) I feel like this is quite doable:
For JS ecosystem taken in isolation, something using NPM as the source of hierarchy would probably work OK.
In general, I think using the downstream importance provided by the package hierarchy is key. This would surface cases like OpenSSL, which is depended on by practically everything, so would have a huge numerator. And while SSL's dedicated support of one paid(ish) developer is high by the standards of most small projects, compared to its ecosystem importance it's really low, so hopefully it would come out as an interesting case.
As with all BS metrics, actually calculating it would be required to determine if it doesn't work.
To identify dependencies, check out Libraries.io, they harvest data from package managers to identify libraries dependencies.
Perhaps a few example projects which are considered 'good' and a few which are considered 'endangered' could help drive the initial discussion.
Also take into account that the base information needed to construct a good metric may not readily available (yet).
It seems there could be a multi-dimensional metric that indicates the importance in terms of dependencies and the risk/endangered status. The following relates to the risk side, along the lines of the comment by ericholscher. The "bus factor" does not necessarily indicate that a project lacks developers, but indicates where knowledge of a project is highly concentrated, exposing it to risk. For some projects, the loss of just a single developer may have serious impact.
There is an automated tool (https://github.com/SOM-Research/busfactor) that can assess a project's bus factor based on an analysis of the project's Git (not necessarily Github) repo. There is a link to a demo at the end of the following: http://modeling-languages.com/whats-bus-factor-software-project/ Here is a conference paper description of the project: https://www.researchgate.net/publication/272824568_Assessing_the_Bus_Factor_of_Git_Repositories They also refer in the paper to an implementation of the concept for SVN repos, where the risk factor was called the "truck factor." (I have no connection with the authors.)
GitHub is not open source, so like many people still don't use GMail, some projects do not use GitHub. If we can only measure popularity with (closed source) GitHub, this means the open source model is somewhat broken or hijacked. And we risk to erase those independent project and replace them with corporate ones.
OpenHub has a lot of data on open-source projects, and they already do risk profiles. If you asked nicely they might construct an endangered project metric.
Thanks everyone for the great suggestions! @madnificent after thinking about some examples I think @ericholscher's point is probably most important: low (like 1-2) # contributors with a high # dependents = endangered, to me. Sounds like the bus factor supports this.
Figuring out dependencies is still not clear to me though. Using libraries.io, bundler has 24.7k dependent repos but pip has 96 and setuptools has 159? how does that make sense?
Side note: I think I prefer the analogy of "keystone species" to "endangered" because endangered can imply that a project is dying/on its way out, which is not the connotation I'm going for. A keystone species is a species that an ecosystem relies disproportionately on, relative to its population size, which makes it especially worth protecting (if it disappears, a lot of other things disappear, too). So a project with a ton of dependents but maintained by 1-2 people = keystone
Edit: updated original post to reflect this
Using libraries.io, bundler has 24.7k dependent repos but pip has 96 and setuptools has 159? how does that make sense?
setuptools and pip usually don't appear in the list of dependencies. Instead libraries.io would need to search the files in every python project for import statements pointing to setuptools' modules, and command lines that look like they're calling pip.
@Changaco hmm. that makes it hard to use libraries.io's dependencies metrics as an apples-to-apples comparison of importance then, right?
Team Lead of Open Hub here; take our metrics! All the data on the Open Hub is available through our API. We have numbers on number of committers, number of commits, summary of last 30 days and last 12 months, trends (increasing or decreasing activity), etc.
I'm intrigued by this "endangered project metric" idea as well. Also, we're working on new statistics about vulnerability data.
About download/# of dependents metrics, there's also a big difference to take into account when comparing projects who're utilities other projects depends on and when it's end user tools (dev tooling, monitoring, daemon, package managers). Dependencies get included in way more projects than end user tools, so only considering download metrics would not display the same picture as the reality.
FWIW, npm front page packages are packages people install manually (npm install this-package
instead of dependencies being pulled behind the scene as a dependency).
A couple other metrics to consider:
Potential issues with other metrics proposed:
Some projects are not hosted on github but are very important. Some projects do not have only one repo but more repos, a few on github and maybe the main one on a private git server. Data on github can be easily faked.
Google trends can be an option to understand the popularity of a project.
Irc channel will give you an interesting metric regarding how much support is given to users.
Some human decision is ultimately needed.
hmm. that makes it hard to use libraries.io's dependencies metrics as an apples-to-apples comparison of importance then, right?
Dependencies metrics should be pretty reliable for everything besides build tools, and you can rate those manually because there aren't so many of them. For python's setuptools I think you can assume that all projects on PyPI depend on it. Other PyPA projects like pip aren't used by every project but they're just as important.
OpenSSL has no replacement in the wings.
There were at least two alternatives even before Heartbleed, and new ones since then.
This may come across as a bit nitpicky, but thats' not how it's intended; rather I think clarity in the language used to discuss these topics will help peoples' understanding of them.
projects with heavy dependencies relative to number of maintainers [emphasis added]
Shouldn't the term used here be "dependents" rather than "dependencies"? That is, here you mean "a lot of projects depend on [this project] relative to the number of people working on [this project]", right? For instance, Packagist uses this term, and shows stats for packages there; here's the dependents page for phpunit.
I'm not sure that we can reliably rank projects purely by quantitative metrics. Looking at programs like Google Summer of Code, Stripe Open Source Grant, and many others, they're not based on metrics (though metrics don't hurt).
I think it's best to think of the metrics mentioned here as a threshold: It's safe to assume that no stars/downloads/etc is not an important project yet. But 200 stars, 2000 stars, 20000 stars may very well be of equal or even inverse importance. The example I always use: ssh-chat has x2 the stars of urllib3, but probably 1/100000 the users.
I imagine that the best way to move forward is on a proposal-based system, akin to enhancement/improvement proposals commonly done for open source projects (such as Python's PEP). Maybe we can call it a Project Assistance Proposal (PAP). To qualify, the project would need to exceed some minimum metric threshold, then put out a call for some specific genre of assistance (financial or otherwise) and outline a plan for utilizing the assistance. Basically a grant proposal.
I want to toss in a counter-intuitive notion.
Point #1 might seem problematic. We're trying to identify important projects that are at risk of stagnation. I think in practice there are a few kinds of stagnated projects: feature complete (tex), forked (Python PIL), or not useful. Devs using an unmaintained but high value project usually get commit access or fork it.
Hitting a project that's valuable but isn't already being supported, then, would imply we look for things that have large numbers of commits (or forks) in off hours. Of course determining what off hours are for any particular committer may be difficult, but maybe someone has a clever idea.
I believe @shazow is spot on.
@tnorthcutt great point! updated.
@wesc that's a very interesting metric (commits during working hours), though as you said harder to measure.
@shazow I think you're right, it seems like there's no obvious way to do this in a purely quantitative manner. And given the volume of projects right now (there are a lot, but we're not talking 10s of 1000s here), it seems reasonable to evaluate them qualitatively. I think having some sort of criteria helps me think through which factors to consider on a high level, and I feel like we've got some decent consensus on that, which is good.
@PeterDP I've gone through the Open Hub API Terms of Use and one part troubles me:
You agree not to: [....] Combine or aggregate analysis, ratings, rankings or synthetic metrics created and reported by the Sites with data from your Application or from other sources to create composite metrics. Synthesized or collected Content obtained through the API Feature must stand on its own. [....]
That's exactly what's discussed here, isn't it?
I was playing around with the APIs that were mentioned here and some other public data sources over the weekend and found it extremely hard to rank projects in a way that makes sense.
How could you possibly compare an open source operating system like debian to a css framework? Or a task scheduler like celery to a collection of icon fonts like font-awesome?
Font-awesome has almost 10 times more stars on github than celery. How on earth would you tell a dumb ranking algorithm what is more important?
We need brains for that, at least for the big and middle-sized stones.
Need a hot-or-not game for OSS projects that devs can play while their NPM modules download and their native code compiles
Related to #9, how do we rank which projects are "important"?
@karissa and I discussed this last month, and I did some research on Linux CII's census rankings. They use a point system to rank projects. I'd like to come up with a similar system vs. relying on user votes.
My main interest is identifying projects that are currently "keystone species": projects with many dependents relative to number of maintainers, which makes them especially worth protecting. I think theoretical projects with high potential should be ranked separately.
Criteria for identifying keystone species
Popularity
Language
Dependencies
Think this is the hardest to measure. How could we get these data? Other ideas?
Activity
Feedback welcomed on the above. Anything I'm missing?
And, anyone want to take a stab at criteria to evaluate theoretical projects for potential impact?