How to rank projects? - Githubissues

nayafia commented 8 years ago

Related to #9, how do we rank which projects are "important"?

@karissa and I discussed this last month, and I did some research on Linux CII's census rankings. They use a point system to rank projects. I'd like to come up with a similar system vs. relying on user votes.

My main interest is identifying projects that are currently "keystone species": projects with many dependents relative to number of maintainers, which makes them especially worth protecting. I think theoretical projects with high potential should be ranked separately.

Criteria for identifying keystone species

Popularity

Is the project on GitHub?
Number of GitHub stars
Number of GitHub forks
Language
Older languages are more risky
Dependencies

Think this is the hardest to measure. How could we get these data? Other ideas?

Total number of downloads
Number of downloads in last 12 months
Number of times required in other repositories
Number of mentions in other projects’ readmes
Activity
Number of pull requests submitted in last 12 months
Number of issues opened in last 12 months
Number of commits in last 12 months
Number of contributors in last 12 months

Feedback welcomed on the above. Anything I'm missing?

And, anyone want to take a stab at criteria to evaluate theoretical projects for potential impact?

ericholscher commented 8 years ago

I think a good indictor is % of commits by 1 or 2 people, as well. In our case, we have an active repo, but it's mainly just the same people doing work, without a larger community helping out in the code.

pramsey commented 8 years ago

On the systems side (packaged software installed using rpm/deb) I feel like this is quite doable:

For each package measure "popularity" (btw, using github alone is going to miss a lot of systems level projects that are really important, but developing using svn or other infra)
Then traverse the package hierarchy for something like Centos base + EPEL or Ubuntu + extras and calculate for each package the total popularity of the package + all dependent packages, this gives you an "ecosystem importance" for the project, that's the numerator
Finally work out the activity (# committers and commits in past 12 months) and use that as your denominator

For JS ecosystem taken in isolation, something using NPM as the source of hierarchy would probably work OK.

In general, I think using the downstream importance provided by the package hierarchy is key. This would surface cases like OpenSSL, which is depended on by practically everything, so would have a huge numerator. And while SSL's dedicated support of one paid(ish) developer is high by the standards of most small projects, compared to its ecosystem importance it's really low, so hopefully it would come out as an interesting case.

As with all BS metrics, actually calculating it would be required to determine if it doesn't work.

AmrEldib commented 8 years ago

To identify dependencies, check out Libraries.io, they harvest data from package managers to identify libraries dependencies.

madnificent commented 8 years ago

Perhaps a few example projects which are considered 'good' and a few which are considered 'endangered' could help drive the initial discussion.

Also take into account that the base information needed to construct a good metric may not readily available (yet).

nickpowersys commented 8 years ago

It seems there could be a multi-dimensional metric that indicates the importance in terms of dependencies and the risk/endangered status. The following relates to the risk side, along the lines of the comment by ericholscher. The "bus factor" does not necessarily indicate that a project lacks developers, but indicates where knowledge of a project is highly concentrated, exposing it to risk. For some projects, the loss of just a single developer may have serious impact.

There is an automated tool (https://github.com/SOM-Research/busfactor) that can assess a project's bus factor based on an analysis of the project's Git (not necessarily Github) repo. There is a link to a demo at the end of the following: http://modeling-languages.com/whats-bus-factor-software-project/ Here is a conference paper description of the project: https://www.researchgate.net/publication/272824568_Assessing_the_Bus_Factor_of_Git_Repositories They also refer in the paper to an implementation of the concept for SVN repos, where the risk factor was called the "truck factor." (I have no connection with the authors.)

techtonik commented 8 years ago

GitHub is not open source, so like many people still don't use GMail, some projects do not use GitHub. If we can only measure popularity with (closed source) GitHub, this means the open source model is somewhat broken or hijacked. And we risk to erase those independent project and replace them with corporate ones.

Mathnerd314 commented 8 years ago

OpenHub has a lot of data on open-source projects, and they already do risk profiles. If you asked nicely they might construct an endangered project metric.

nayafia commented 8 years ago

Thanks everyone for the great suggestions! @madnificent after thinking about some examples I think @ericholscher's point is probably most important: low (like 1-2) # contributors with a high # dependents = endangered, to me. Sounds like the bus factor supports this.

Figuring out dependencies is still not clear to me though. Using libraries.io, bundler has 24.7k dependent repos but pip has 96 and setuptools has 159? how does that make sense?

nayafia commented 8 years ago

Side note: I think I prefer the analogy of "keystone species" to "endangered" because endangered can imply that a project is dying/on its way out, which is not the connotation I'm going for. A keystone species is a species that an ecosystem relies disproportionately on, relative to its population size, which makes it especially worth protecting (if it disappears, a lot of other things disappear, too). So a project with a ton of dependents but maintained by 1-2 people = keystone

Edit: updated original post to reflect this

Changaco commented 8 years ago

Using libraries.io, bundler has 24.7k dependent repos but pip has 96 and setuptools has 159? how does that make sense?

setuptools and pip usually don't appear in the list of dependencies. Instead libraries.io would need to search the files in every python project for import statements pointing to setuptools' modules, and command lines that look like they're calling pip.

nayafia commented 8 years ago

@Changaco hmm. that makes it hard to use libraries.io's dependencies metrics as an apples-to-apples comparison of importance then, right?

PDegenPortnoy commented 8 years ago

Team Lead of Open Hub here; take our metrics! All the data on the Open Hub is available through our API. We have numbers on number of committers, number of commits, summary of last 30 days and last 12 months, trends (increasing or decreasing activity), etc.

I'm intrigued by this "endangered project metric" idea as well. Also, we're working on new statistics about vulnerability data.

SBoudrias commented 8 years ago

About download/# of dependents metrics, there's also a big difference to take into account when comparing projects who're utilities other projects depends on and when it's end user tools (dev tooling, monitoring, daemon, package managers). Dependencies get included in way more projects than end user tools, so only considering download metrics would not display the same picture as the reality.

FWIW, npm front page packages are packages people install manually (npm install this-package instead of dependencies being pulled behind the scene as a dependency).

robkinyon commented 8 years ago

A couple other metrics to consider:

Importance to the community
- Number of projects it's depended by (primarily for infrastructure).
- You have to look at the project's released items in their releasing infrastructure. Rubygems or CPAN, not Github.
- Number of tutorials and blog posts written in the past N years (primarily for things like ORMs)
- This may be a second-order metric, after a first pass cut is performed.
Activity in related IRC channel(s)
- Maybe also consider Twitter?
Potential replacements
- OpenSSL has no replacement in the wings.
- Rails, on the other hand, has multiple projects that could grow to replace it.
- Of course, nothing matches feature-by-feature, so that might be a consideration.

Potential issues with other metrics proposed:

Most projects aren't downloaded from GitHub, but are downloaded from rubygems, cpan, pypi, etc.

nemesifier commented 8 years ago

Some projects are not hosted on github but are very important. Some projects do not have only one repo but more repos, a few on github and maybe the main one on a private git server. Data on github can be easily faked.

Google trends can be an option to understand the popularity of a project.

Irc channel will give you an interesting metric regarding how much support is given to users.

Some human decision is ultimately needed.

Changaco commented 8 years ago

hmm. that makes it hard to use libraries.io's dependencies metrics as an apples-to-apples comparison of importance then, right?

Dependencies metrics should be pretty reliable for everything besides build tools, and you can rate those manually because there aren't so many of them. For python's setuptools I think you can assume that all projects on PyPI depend on it. Other PyPA projects like pip aren't used by every project but they're just as important.

Changaco commented 8 years ago

OpenSSL has no replacement in the wings.

There were at least two alternatives even before Heartbleed, and new ones since then.

tnorthcutt commented 8 years ago

This may come across as a bit nitpicky, but thats' not how it's intended; rather I think clarity in the language used to discuss these topics will help peoples' understanding of them.

projects with heavy dependencies relative to number of maintainers [emphasis added]

Shouldn't the term used here be "dependents" rather than "dependencies"? That is, here you mean "a lot of projects depend on [this project] relative to the number of people working on [this project]", right? For instance, Packagist uses this term, and shows stats for packages there; here's the dependents page for phpunit.

shazow commented 8 years ago

I'm not sure that we can reliably rank projects purely by quantitative metrics. Looking at programs like Google Summer of Code, Stripe Open Source Grant, and many others, they're not based on metrics (though metrics don't hurt).

I think it's best to think of the metrics mentioned here as a threshold: It's safe to assume that no stars/downloads/etc is not an important project yet. But 200 stars, 2000 stars, 20000 stars may very well be of equal or even inverse importance. The example I always use: ssh-chat has x2 the stars of urllib3, but probably 1/100000 the users.

I imagine that the best way to move forward is on a proposal-based system, akin to enhancement/improvement proposals commonly done for open source projects (such as Python's PEP). Maybe we can call it a Project Assistance Proposal (PAP). To qualify, the project would need to exceed some minimum metric threshold, then put out a call for some specific genre of assistance (financial or otherwise) and outline a plan for utilizing the assistance. Basically a grant proposal.

wesc commented 8 years ago

I want to toss in a counter-intuitive notion.

If a project is valuable, it gets commits or forks.
If a company is supporting a project, then the commits occur during working hours.

Point #1 might seem problematic. We're trying to identify important projects that are at risk of stagnation. I think in practice there are a few kinds of stagnated projects: feature complete (tex), forked (Python PIL), or not useful. Devs using an unmaintained but high value project usually get commit access or fork it.

Hitting a project that's valuable but isn't already being supported, then, would imply we look for things that have large numbers of commits (or forks) in off hours. Of course determining what off hours are for any particular committer may be difficult, but maybe someone has a clever idea.

nemesifier commented 8 years ago

I believe @shazow is spot on.

nayafia commented 8 years ago

@tnorthcutt great point! updated.

@wesc that's a very interesting metric (commits during working hours), though as you said harder to measure.

@shazow I think you're right, it seems like there's no obvious way to do this in a purely quantitative manner. And given the volume of projects right now (there are a lot, but we're not talking 10s of 1000s here), it seems reasonable to evaluate them qualitatively. I think having some sort of criteria helps me think through which factors to consider on a high level, and I feel like we've got some decent consensus on that, which is good.

jayfk commented 8 years ago

@PeterDP I've gone through the Open Hub API Terms of Use and one part troubles me:

You agree not to: [....] Combine or aggregate analysis, ratings, rankings or synthetic metrics created and reported by the Sites with data from your Application or from other sources to create composite metrics. Synthesized or collected Content obtained through the API Feature must stand on its own. [....]

That's exactly what's discussed here, isn't it?

jayfk commented 8 years ago

I was playing around with the APIs that were mentioned here and some other public data sources over the weekend and found it extremely hard to rank projects in a way that makes sense.

How could you possibly compare an open source operating system like debian to a css framework? Or a task scheduler like celery to a collection of icon fonts like font-awesome?

Font-awesome has almost 10 times more stars on github than celery. How on earth would you tell a dumb ranking algorithm what is more important?

We need brains for that, at least for the big and middle-sized stones.

pramsey commented 8 years ago

Need a hot-or-not game for OSS projects that devs can play while their NPM modules download and their native code compiles

jayfk / fundingoss.com

How to rank projects? #19

Criteria for identifying keystone species

Popularity

Language

Dependencies

Activity