ossf / census

📜Automated review of open source software projects
Other
115 stars 30 forks source link

Adjust the Popularity measure in the risk index #21

Open skhakimov opened 9 years ago

skhakimov commented 9 years ago

Currently a package receives a point if it is in the top 90% of packages analyzed. Making this a relative measure. Consider making it absolute, adjusting this measure to the top 5% of ALL debian packages based on [1]. With more than 140K packages being tracked by the popularity contest, it is more sensible to reduce this measure to a much smaller percentage. Even 1% (~1400 packages) can be a reasonable threshold. Thanks.

[1] http://popcon.debian.org/

david-a-wheeler commented 9 years ago

I like this idea of emphasizing the top 5% of ALL Debian packages. Is 5% the right value - should it be 1% or 2%? Perhaps we could give it 2 points if it's in the top 1%, and 1 point if it's in the 2-5% popularity of all packages; that would provide a little gradation. Issue #5 notes that we could add other popularity information sources; if that's done, we might need to revisit.

david-a-wheeler commented 9 years ago

Sam and I looked at the Debian popularity values in more detail. We think that giving additional scores at the 5% and 1% level would be justifiable (2 points if within the top 1% of popularity, 1 point if within the top 5% of popularity but not the top 1%). Here's why.

Looking at the popularity graph, the "knee" in the curve - which we'll define as the place where the absolute value of the slope of the curve is one - is at about package 5000 (out of 146754 packages). That means that the curve switches to a slope of less than one at about 3.4% into the set. Since this only a sample set, it makes sense to use a slightly broader definition, so I suggest that we cut off popularity at about 5% (since that would clearly include the 3.4% transition location), which would cut it off at package number 7338.

We then re-examined these top 5% values, and there's another transition within that set at about 1% of the total number of package. IE, the top 1% of all packages are ESPECIALLY popular.

Obviously the number of packages and their popularity changes over time; we want to use reasonable cutoffs that are a little less sensitive to exact values. Cutoffs of 5% and 1% are fairly common, and seem justified by the data set.

david-a-wheeler commented 9 years ago

BTW, there seems to be no universal definition of a "knee" in a graph. More complex systems for defining and finding knees in curves (compared to what we used) can be found here:

These involve finding the maximum of the curvature, which for a continuous function is: curve(x) = y'' / ((1 + (y')^2)^1.5) where y' is the first derivative and y'' is the second derivative of y=f(x). Mathematical detail at https://en.wikipedia.org/wiki/Curvature and http://mathworld.wolfram.com/Curvature.html (among other places).

I don't think we need to dig into these more complex systems for our purposes.

skhakimov commented 9 years ago

Popularity chart by installations for all Debian packages: popularity_chart

Popularity chart of the top 5% of Debian Packages. popularity5_chart

Data was obtained from: http://popcon.debian.org/by_inst