golang / gddo

Go Doc Dot Org
https://godoc.org
BSD 3-Clause "New" or "Revised" License
1.1k stars 265 forks source link

Proposal for Package Ranking #320

Open carlisia opened 9 years ago

carlisia commented 9 years ago

Overview

Rank by Objective Quality Standards

A mechanism to transparently and objectively rank projects by quality level.

We're proposing to avoid the use of a user-based rating or feedback system. It would require the creation and maintenance of accounts and would be subject to abuse. Instead, we propose using a published set of project quality standards that are not subject to simple manipulation:

This could use (via API or similar) the system implemented by GoReportCard, for example: image

The results of the ranking score would be used as the primary sort mechanism when browsing packages by search or by category. The information would also be linked/displayed at the top of each documentation page on Godoc.org.

References: Examples of quality assessement tools: https://medium.com/@jgautheron/quality-pipeline-for-go-projects-497e34d6567

Include Ranking in Display

Once the ranking system is in place, include a summary of the score (and link to view details/explanation) on both the documentation pages and in a second column on search results.


Contributors to this proposal

@carlisia @gdey @rafaeljusto

adg commented 9 years ago

I'm in favor of adding more signal to godoc.org's ranking algorithm. As such, I'm generally in favor of this proposal.

Let me comment on a few of the proposed metrics, and how they might be computed:

Percent of code coverage achieved by unit tests

This requires integration with a continuous integration system. As such, it's probably the hardest one to gather data for. Defer til last. (Also, this is easily gamed; just add a function 100k lines long and write one test to invoke it.)

Presence of package and type documentation for public types

Sounds good, if we're not doing this already (can't recall). I implemented this in a separate project a long time ago.

Number of forks and downloads on repository. Measure of recent activity on the project (commits in the last X days)

How do we measure repository downloads? GitHub stars seem like good signals. Forks are ambiguous (@garyburd raises some interesting questions).

Number of includes in other public projects.

We already have this data. Seems like a no-brainer.

Number of pageviews for that project on GoDoc.org

We're not currently gathering this data, but it's probably worth doing.

When we decide to embark on implementing any of these specific metrics, please create a separate issue for that particular metric so that we can nail down the design before implementation.

jbuberel commented 9 years ago

@garyburd That's a good point about low activity on an established, high-quality project. Do you agree with @adg's suggestion that GitHub starts are a reasonable proxy for "this is a good project"?

One of the odd things about Github stars is that they're cumulative, with no decay function. You can star a project at a point when it was well maintained. A year goes by, it's been abandoned, but your star is still there. Hmmm. I'm having a hard time thinking of a better replacement.

@adg's spot on about the percent test coverage metric. That would require real compute time (and real money). That being said, it would be one of the better signals of project quality.

Regarding download counts: I do not think there is any real way of doing this. Even the GitHub repos api does not report this. You do get stargazer_count though.

gdey commented 9 years ago

@garyburd

What are your goals for search ranking and how does ranking by these metrics support the goals? One goal might be to help developers find the best package in a some domain. Ranking by number of > imports supports this goal because each import is an indication that a developer found the package useful.

As you pointed out there are two goals for developers looking for packages.

  1. Make it easier for developers to find a packages in a domain.
  2. Make it easier for developers to evaluate the quality of packages in a domain.

Towards this end, the rankings based on imports and test help. As far as tests go, I think these need to be a bit liberal. The idea being is there at least one test, and it covers at least 5% percentage of the code. I don't think we should be expecting the percentage to be more then a single digit. I the the more important metric is the usage in other projects.

For quality I think percentage of documentation for package and public members is important. As well as the presence of examples, and testable examples is more important. As well as coming up with a ranking based on the output of go fmt, go vet, and other code linting tools.

But there is another audience as well; that is the package authors themselves. For them the goal is to provide feedback on how they can improve the quality of their projects. Having such a ranking system that is objective, and can show you what you need to do to improve your packages standing — I believe, will raise the base line quality of all packages.

Ranking by recent activity does not necessarily support this goal. Activity on a high quality mature package can be low. Activity on a buggy package can be high.

A part of the original proposal document that when towards building these requests was left out by mistake.

Archive Expired Packages

An archive section to where libraries that have been inactive for a given period of time and not used by a number of other active projects. This would help keep the listing to only active projects. Inactive is defined as "Most recent commit > 365 days ago" and "number of imports < 10", and can be adjusted.

The 365 and 10 are just place holders and can be changed if needed. The idea here being, that we need to, also, group or filter packages; so that there isn't a overwhelming amount of choice.

I would like to reference #90 as another issues that is trying to solve the getting too many packages to find the trees from the forest problem.

adg commented 9 years ago

Note that this is related to #52

adg commented 9 years ago

And #172

jbuberel commented 9 years ago

This is a terrific idea, @garyburd:

I suggest writing a command line app to generate a list the packages to archive for a given import count and last commit time and checking the results to see if the filter is doing the right thing.

Implement the proposed filtering criteria, then test against the search results for a common term with a large result set, such as "web" or "sql" or "middleware". Generate side-by-side diff-able output with and without the filter.

Once we're confident that the filtering is "fair", then move onto changes in the ranking. Again, with the intent of being able to compare current vs proposed ranking so we can vet the diffs.

jbuberel commented 9 years ago

@garyburd Very much agreed.

rafaeljusto commented 9 years ago

@garyburd I started building the command line: https://github.com/rafaeljusto/gddoexp

dmitshur commented 9 years ago

The expired package idea looks promising. My intuition is that a conservative filter of 0 imports from outside of the repo and no commits in two years will filter out a lot of junk.

I think that's a great idea. But, an observation, that will have false positives for commands or libraries meant to be used at go generate time, since they're typically imported from other packages in // +build ignore files. For similar reasons, it won't work for libraries that are meant to be used with OSes or architectures that godoc does not support/know about.

rafaeljusto commented 9 years ago

I ran the tool that checks the expired packages on a database dump from 2015-10-01. It analyzed 132277 packages in 36h45m due to Github rate limit policies. The results:

% Description
3.88 not a Github project
2.75 should be archived
0.63 unexpected status code from Github

The unexpected status code from Github is probably some rate limit issues (403 Forbidden) that could be solved adjusting the token bucket values or analyzing the HTTP response headers from Github. The tool algorithm currents make two checks to identify if a package should be archived:

We also got 6 connections timeouts.

jbuberel commented 9 years ago

Indeed, the list of "should be archived" is crucial here. We need to sanity check that to ensure that no legitimately keep-worthy projects would be get the archive treatment :-) Can you @rafaeljusto pastebin or gist it for us?

rafaeljusto commented 9 years ago

I checked some of them to see if they were modified in the last 2 years. But I didn't check if they were referenced by other packages, I'm trusting in the gddo database information.

Here is the list of packages to archive: https://gist.github.com/rafaeljusto/0ef14863b39c23517e0a

rafaeljusto commented 9 years ago

Sure! I've created another program that inform packages with score 0 (zero) from an input list. So, from the list of packages that should be archived we have:

% Description
67.30 has no score
32.70 has score

The list of packages with score that should be archived are here: https://gist.github.com/rafaeljusto/d2795a100f4661b9b126

rafaeljusto commented 9 years ago

Working on it. =)

rafaeljusto commented 9 years ago

I've created a new filter that checks for forks with maximum of 2 commits in the week after the fork date, I called then "fast forks".

On the list of scored packages that should be archived, when applying this filter we got:

% Description
51.05 fast fork
48.95 not fast fork

The list of packages after applying this 2 filters can be found bellow: https://gist.github.com/rafaeljusto/3131e9e43c905d2e0808

jbuberel commented 9 years ago

I just spot-checked about 50 items from the new list, and I didn't see any false-positives (project that would have been archived but should not have been). So far, LGTM.

jbuberel commented 9 years ago

Sounds like we're in general agreement that the new fast-fork filter is working well as an identifier of packages that should be considered "archived" and therefore not displayed in search results.

Using that filtered set as the base, it make sense to begin experimenting with the rankings (the primary goal of this proposal). Given that gddo already takes import counts into account, how about an experiment to apply the stored page view counts for a small set of common search terms ("sql" and "middleway"), with outputs that allow us to diff/compare the ordering of:

rafaeljusto commented 9 years ago

I ran the tool again now replacing the 2 years condition for the fast fork. It analyzed 132277 packages in 4h0m42s (we got many cache hits). The results:

% Description
3.88 not a Github project
14.14 should be archived
1.47 unexpected status code from Github

We also got 9 connections timeouts.

The new list of packages that could be archived are bellow: https://gist.github.com/rafaeljusto/db8318b69efb10f622aa

We increased the packages to archive in 11.39% comparing with the first result. I think we could apply both rules: we archive if the package is a fast fork or has more than two years with no changes, already considering that there are no other packages referencing it.

There are other cases where we got a 404 from Github API that we could also archive, but this are only a few cases. I still need to work on the tool to avoid rate limit and decrease the "unexpected status code from Github" percentage.

PS: I will be offline for a week (hello vacations!)

carlisia commented 9 years ago

This writeup is relevant for this discussion: https://github.com/mikeal/go-stats/blob/master/README.md

jamra commented 8 years ago

@carlisia I don't know why package count should be compared with other communities. It shouldn't matter. Some Go projects on github divide their package into many small packages. That is the Go way to do it and helps make things go gettable.

In terms of ranking packages on godoc, you bring up a good point: We can use an "imported by" metric to rank packages. That could remove some of the noise added by some of these sub packages.