andrew commented 6 years ago

SourceRank 2.0

Below are my thoughts on the next big set of changes to "SourceRank", which is the metric that Libraries.io calculates for each project to produce a number that can be used for sorting in lists and weighting in search results as well as encouraging good practises in open source projects to improve quality and discoverability.

Goals:

Comparable number for packages within a given ecosystem
Easily understandable for a human to scan without explanation
Clear breakdown available of factors that went into the final number
Any negative factors that contribute to the score should be fixable by maintainer
Changes should be trackable over time
Less popular projects that are good quality and well maintained should still score well
Doing work to improve sourcerank for a project should encourage following best practises

History:

SourceRank was inspired by google pagerank as a better alternative score to GitHub Stars.

The main element of the score is the number of open source software projects that depend upon a package.

If a lot of projects depend upon a package that implies a some other things about that package:

it's working software, people tend to remove dependencies that are broken
It's documented enough to be used by the people that have committed it

Problems with 1.0:

Sourcerank 1.0 doesn't have a ceiling to the score, the project with the highest score is mocha with a sourcerank of 32, when a user is shown an arbitrary number it's very difficult to know if that number is good or bad. Ideally the number should either be out of a total, i.e. 7/10, a percentage 82% or a score of some kinda like B+

Some of the elements of sourcerank cannot be fixed as they judge actions from years ago with the same level as recent actions, for example "Follows SemVer?" will punish a project for having an invalid semver number from many years ago, even if the project has followed semver perfectly for the past couple years. Recent behaviour should have more impact that past behaviour.

If a project solves a very specific niche problem or if a project tends to be used within closed source applications a lot more than open source projects (LDAP connectors, payment gateway clients etc) then usage data within open source will be small and sourcerank 1.0 would rank it lower.

Related to small usage factors, a project within a smaller ecosystem will currently get a low score even if it is the most used project within that whole ecosystem when compared to a much larger ecosystem, elm vs javascript for example.

Popularity should be based on the ecosystem which the package exists within rather than within the whole Libraries.io universe.

Quality or issues with dependencies of a project are not taken into account, if adding a package brings with it some bad quality dependencies then that should affect the score, in a similar way, the total number of direct and transitive dependencies should be taken into account.

Projects that host their development repository on GitHub currently get a much better score than ones hosted on GitLab, Bitbucket or elsewhere. Whilst being able to see and contribute to the projects development is important, where that happens should not influence the score based on the ease of access to that data.

Sourcerank is currently only calculated at the project level, but in practise the sourcerank varies by version of a project as well for a number of quality factors.

Things we want to avoid:

Past performance is not indicative of future results - when dealing with voluteeners and open source projects where the license says "THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND" we shouldn't reward extra unpaid work

Metrics that can easily be gamed by doing things bad for the community, i.e. hammering the download endpoint of a package manager to boost the numbers

Relying to heavily on any metrics that can't be collected for some package managers, download counts and dependent repository counts are

Proposed changes:

Group SourceRank breakdown into 3 or 4 sections (usage, quality, community, distribution)
Usage/popularity metrics should be based on projects ecosystem scale
Overall score should be between 0 and 100
Time based score elements should focus on more recent behaviours
SourceRank of Dependencies (both direct and transitive) should be taken into account

Potential 2.0 Factors/Groupings:

Usage/Popularity

Dependent projects and repositories
Stars

Quality

Basic details info present
Following semver recently
Source repository present
Documentation (readme present)
SourceRank of runtime dependencies on latest version
Outdated status of runtime dependencies on latest version
Status (prerelease, new, deprecated, unmaintained, yanked etc)
License present?
License compatibility of dependencies
Prerelease?
Major version prerelease

Community/maintenance

Bus factor (Number of committers or percentage share of commits per commiter)
How responsive the issue tracker is
Overall repository activity levels
Frequency of releases

Reference Links:

SourceRank 1.0 docs: https://docs.libraries.io/overview#sourcerank

SourceRank 1.0 implementation:

Metrics repo: https://github.com/librariesio/metrics

Npms score breakdown: https://api.npms.io/v2/package/redis

andrew commented 6 years ago

The initial pass at implementing SourceRank 2.0 will be focused on realigning how scores are calculated and shown using existing metrics and data that we have, rather than collecting/adding new metrics.

The big pieces are:

scale the sourcerank between 0 and 100
usage and popularity metrics should only be considered within the packages ecosystem
the scores of top-level dependencies of the latest version of a project should be taken into account
Breakdown should be separated into three sections, with their own scores (Usage, Quality, Community)
Only consider recent releases when looking at historical data (like semver validity)
A good project with source hosted on GitLab or Bitbucket should have a similar score to one hosted on GitHub

andrew commented 6 years ago

There are a few places where the sourcerank 1.0 details/score are exposed in the API that need to be kept around for backwards compatibility:

ProjectSerializer - the rank field is present wherever project records are serialized for the API
sourcerank api endpoint which gives a breakdown of the values that go into sourcerank 1.0 score
project search sort by rank option
rank columns in the open data release files: projects and projects_with_repository_fields

For the first pass, sourcerank 2.0 is going to focus on just packages published to package managers, we'll save the repository sourcerank update for later as it doesn't quite match so well with the focus on projects.

So the following repo sourcerank details in the api will remain the same:

RepositorySerializer - the rank field is present wherever repository records are serialized for the API
repo search sort by rank option

andrew commented 6 years ago

Starting to implement bits over here: https://github.com/librariesio/libraries.io/pull/2056

andrew commented 6 years ago

One other thing that springs to mind, the first pass of the implementation will focus at a project level and only really considers the latest release.

Then we'll move on to tackle https://github.com/librariesio/libraries.io/issues/475 which will store more details at a per-version level, which will then allow us to calculate the SourceRank for each version.

andrew commented 6 years ago

Making good progress on filling in the details of the calculator, current approach for scoring is to take the average of the different categories scores and for each category take the average of the different scores that go into it too, with the max being 100 for score.

andrew commented 6 years ago

Example of the current breakdown of implemented scores:

overall score: 88.7
- popularity_score: 80
- dependent_projects_score: 70
- dependent_repositories_score: 90
- community_score: 99
- contribution_docs_score: 99
- quality_score: 87
- basic_info_score: 87

andrew commented 6 years ago

Things to think about soon:

How to calculate, store and show the change in score for a given category over time
How to visualize what goes into a score and how to improve it

Also because scores now take into account other projects within an ecosystem, we'll likely want to recalculate lots of scores at the same time in an efficient way, for example:

Recalculate every elm module when the most popular elm module jumps in popularity
Recalculate the scores for every gem that depends on rails when the score of rails changes significantly

andrew commented 6 years ago

Added an actual method to output the breakdown of the score:

{
  "popularity": {
    "dependent_projects": 0,
    "dependent_repositories": 0
  },
  "community": {
    "contribution_docs": {
      "code_of_conduct": false,
      "contributing": false,
      "changelog": false
    }
  },
  "quality": {
    "basic_info": {
      "description": true,
      "homepage": false,
      "repository_url": false,
      "keywords": true,
      "readme": false,
      "license": true
    },
    "status": 100
  }
}

andrew commented 6 years ago

Thinking about dependency related scores, here's my current thinking:

one score for the average source rank score for every top level runtime dependency
if a package has zero dependencies then 100/100
packages with excessive numbers of top level runtime dependencies should be marked down independently of the score of each dependency

Eventually we should also look at the size and complexity of the package, as those rules could encourage vendoring of dependencies to avoid them reducing the score.

andrew commented 6 years ago

First pass at an implementation for the calculator is complete in https://github.com/librariesio/libraries.io/pull/2056, going to kick the tires with some data next

andrew commented 6 years ago

Current output for a locally synced copy of [Split]():

{
  "popularity": {
    "dependent_projects": 0,
    "dependent_repositories": 0,
    "stars": 0,
    "forks": 0,
    "watchers": 0
  },
  "community": {
    "contribution_docs": {
      "code_of_conduct": true,
      "contributing": true,
      "changelog": true
    },
    "recent_releases": 0,
    "brand_new": 100,
    "contributors": 100,
    "maintainers": 50
  },
  "quality": {
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": true,
      "keywords": true,
      "readme": true,
      "license": true
    },
    "status": 100,
    "multiple_versions": 100,
    "semver": 100,
    "stable_release": 100
  },
  "dependencies": {
    "outdated_dependencies": 100,
    "dependencies_count": 0,
    "direct_dependencies": {
      "sinatra": 42.99146412037037,
      "simple-random": 45.08333333333333,
      "redis": 42.58333333333333
    }
  }
}

jab commented 6 years ago

I just found an issue with SourceRank's "Follows SemVer" scoring that is not mentioned here. I'm hoping it can be resolved in the next version of SourceRank, and that this is the right place to bring this up:

The relevant standard that Python packages must adhere to for versioning is defined in PEP 440: https://packaging.python.org/tutorials/distributing-packages/#standards-compliance-for-interoperability

As that page says, “the recommended versioning scheme is based on Semantic Versioning, but adopts a different approach to handling pre-releases and build metadata”.

Here is an example Python package where all the release versions are both PEP 440- and SemVer-compliant.

But the pre-release versions are PEP 440-compliant only. Otherwise interoperability with Python package managers and other tooling would break.

The current SourceRank calculation gives this package (and many other Python packages like it) 0 points for "Follows SemVer" due to having published pre-releases. (And only counting what was published recently wouldn't fix this.)

Can you please update the SourceRank calculation to take this into account? Perhaps just ignoring the SemVer requirement for pre-release versions of Python packages, which it’s impossible for them to satisfy without breaking interop within the Python ecosystem, would be a relatively simple fix?

Thanks for your consideration and for the great work on libraries.io!

andrew commented 6 years ago

@jab ah I didn't know that, yeah I think it that makes sense, rubygems has a similar invalid semver prerelease format.

I was thinking about adding in some ecosystem specific rules/changes, this is a good one to experiment with.

andrew commented 6 years ago

Sourcerank 2.0 things I've been thinking about over the long weekend:

[x] We'll need to cache the overall score onto a separate field to rank on the projects table, most likely called sourcerank_2 with a default of 0
[x] Would be good to know when the sourcerank was last calculated to easily find the projects with the most outdated score to be updated next, most likely a datetime field called sourcerank_2_last_calculated that defaults to null for projects that have never had a score calculated
[x] Now that dependency scores can be recursive, it would be good to be able to see how many top-level runtime dependencies a project has, which ends up being "how many top-level runtime dependencies does a version have", which we can then cache onto projects for the latest version. This then allows us to easily find zero dependency projects, perfect for calculating the sourcerank for first. #2079
[ ] To get the timeseries element for sourcerank 2.0 to show improvement over time, we should store the whole breakdown for each project on a regular basis, as we won't easily be able to recreate some older breakdowns as the values of dependent repos and dependent projects change without keeping a history. Need to ponder on this some more but I'm currently thinking we store it in a separate table and keep a score per project each week.
[x] We don't have 100% support across all package managers, so the score needs to be able to skip certain checks rather than giving 0, on a similar note, if the project's source is on bitbucket, there's no concept of stars, in that case it shouldn't be given zero stars as it will have a big impact on the overall popularity score.

chris48s commented 6 years ago

Packagist/composer also defines similar: https://getcomposer.org/doc/04-schema.md#version

It seems like "follows [ecosystem specific version schema]" would be a better criteria than "follows SemVer", but I can see that this could add a lot of additional complexity when you deal with a lot of languages.

andrew commented 6 years ago

I've added runtime_dependencies_count fields to both projects and versions, running some background to backfill those counts on all package managers where we have both the concept of Versions and Dependencies.

Also highlights the need for https://github.com/librariesio/libraries.io/issues/543 sooner rather than later for all the package managers that use Tags rather than Versions.

andrew commented 6 years ago

Mos of the runtime_dependencies_count tasks finished over night, just python and node.js still running.

Versions updated: 11,658,167 Projects updated: 716,828

And just for fun, the average number of runtime dependencies across:

all versions: 3.03 all projects: 1.69

Project data broken down by ecosystem:

"NuGet"=>1.71, "Haxelib"=>0.65, "Packagist"=>1.98, "Homebrew"=>1.16, "CPAN"=>4.23, "Atom"=>1.14, "Dub"=>0.72, "Elm"=>2.44, "Puppet"=>1.37, "Pub"=>2.08, "Rubygems"=>1.45, "Cargo"=>0.51, "Maven"=>0.95, "Hex"=>1.31, "NPM"=>2.52, "CRAN=>0.0

*CRAN doesn't use runtime as a kind, so might need to tweak that slightly, imports seems like the most appropriate

python are still running:

"Pypi"=>0.08

andrew commented 6 years ago

Some other thinks to note down:

[ ] we should store the raw numbers used not just the individual scores in the breakdown so we can update the scoring mechanism and be able to recompute old scores easily.
[x] we should use log rather than just linear for popularity metrics
[ ] might want to store the sourcerank version with the breakdown to easily find old scores that need updating after an algo change
[x] should be able to pass in values for max_dependent_projects, max_dependent_repositories etc when initializing the SourceRankCalculator object to enable faster score generation for multiple projects from the same ecosystem

Next up I'm going to add the sourcerank_2 and sourcerank_2_last_calculated to projects then experiment with generating some actual rank figures for one ecosystem.

andrew commented 6 years ago

Another thing that springs to mind:

[ ] we should probably not give 0 to dependencies where the project is unknown to us, this may already happen but it's not tested

andrew commented 6 years ago

Here's what the breakdown looks like now for rails (on my laptop, missing some data)

{
  "overall_score": 81,
  "popularity": {
    "score": 60.0,
    "dependent_projects": 0.0,
    "dependent_repositories": 0.0,
    "stars": 100.0,
    "forks": 100.0,
    "watchers": 100.0
  },
  "community": {
    "score": 93.33333333333333,
    "contribution_docs": {
      "code_of_conduct": true,
      "contributing": true,
      "changelog": false
    },
    "recent_releases": 100,
    "brand_new": 100,
    "contributors": 100,
    "maintainers": 100
  },
  "quality": {
    "score": 80.0,
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": true,
      "keywords": true,
      "readme": true,
      "license": true
    },
    "status": 100,
    "multiple_versions": 100,
    "semver": 0,
    "stable_release": 100
  },
  "dependencies": {
    "score": 89.66666666666667,
    "outdated_dependencies": 100,
    "dependencies_count": 89,
    "direct_dependencies": {
      "sprockets-rails": 72,
      "railties": 83,
      "bundler": 77,
      "activesupport": 84,
      "activerecord": 75,
      "activemodel": 85,
      "activejob": 83,
      "actionview": 82,
      "actionpack": 82,
      "actionmailer": 81,
      "actioncable": 81
    }
  }
}

andrew commented 6 years ago

More fiddling with local data, tables to compare sourcerank 1 and 2 scores this time.

Top 25 local ruby projects ordered by sourcerank 1:

Name	SourceRank 1	SourceRank 2
activesupport	18	86/100
rspec	18	77/100
sinatra	17	80/100
bundler	17	76/100
rubocop	17	81/100
actionpack	17	82/100
rdoc	17	81/100
activemodel	17	85/100
kramdown	16	78/100
activerecord	16	75/100
split	16	74/100
railties	15	83/100
rails	15	81/100
actioncable	15	81/100
test-unit	15	79/100
activejob	15	83/100
actionmailer	15	81/100
concurrent-ruby	15	75/100
pry	15	70/100
simplecov	15	78/100
cucumber	15	76/100
actionview	15	82/100
guard-rspec	14	72/100
json_pure	14	70/100
ffi	14	75/100

Top 25 local ruby projects by both sourcerank 1 and 2

SourceRank 1	SourceRank 2
activesupport (18)	activesupport (86)
rspec (18)	activemodel (85)
sinatra (17)	railties (83)
bundler (17)	activejob (83)
rubocop (17)	actionpack (82)
actionpack (17)	actionview (82)
rdoc (17)	rubocop (81)
activemodel (17)	rdoc (81)
kramdown (16)	actionmailer (81)
activerecord (16)	rails (81)
split (16)	actioncable (81)
railties (15)	sinatra (80)
rails (15)	test-unit (79)
actioncable (15)	kramdown (78)
test-unit (15)	simplecov (78)
activejob (15)	rspec (77)
actionmailer (15)	redcarpet (77)
concurrent-ruby (15)	sprockets (77)
pry (15)	arel (77)
simplecov (15)	bundler (76)
cucumber (15)	cucumber (76)
actionview (15)	webmock (76)
guard-rspec (14)	thor (76)
json_pure (14)	i18n (76)
ffi (14)	mini_mime (76)

Same as the first table but with top 50 and github star column included:

Name	SourceRank 1	SourceRank 2	Stars
activesupport	18	86/100	39180
rspec	18	77/100	2237
activemodel	17	85/100	39180
actionpack	17	82/100	39180
rdoc	17	81/100	459
rubocop	17	81/100	8875
sinatra	17	80/100	9907
bundler	17	76/100	4152
kramdown	16	78/100	1199
activerecord	16	75/100	39180
split	16	74/100	2105
railties	15	83/100	39180
activejob	15	83/100	39180
actionview	15	82/100	39180
actioncable	15	81/100	39180
rails	15	81/100	39180
actionmailer	15	81/100	39180
test-unit	15	79/100	170
simplecov	15	78/100	3325
cucumber	15	76/100	4974
concurrent-ruby	15	75/100	4020
pry	15	70/100	5100
redcarpet	14	77/100	4180
thor	14	76/100	4012
webmock	14	76/100	2795
ffi	14	75/100	1462
guard-rspec	14	72/100	1122
racc	14	71/100	347
mocha	14	71/100	921
json_pure	14	70/100	495
i18n	13	76/100	696
capybara	13	75/100	8293
rake-compiler	13	73/100	444
rdiscount	13	70/100	764
aruba	13	69/100	803
hoe-bundler	13	66/100	6
RedCloth	13	66/100	430
coderay	13	65/100	704
sprockets	12	77/100	555
liquid	12	75/100	6251
gherkin	12	75/100	249
uglifier	12	74/100	512
slop	12	74/100	886
globalid	12	74/100	591
erubi	12	74/100	198
backports	12	73/100	283
rails-html-sanitizer	12	73/100	143
mime-types	12	72/100	248
mustermann	12	69/100	564
hoe-git	12	68/100	24

All of these are missing a lot of the "popularity" indicators as I just synced a few hundred rubygems locally without all the correct dependent counts.

andrew commented 6 years ago

And at the bottom end of the chart:

Name	SourceRank 1	SourceRank 2
gem_plugin	2	43/100
shellany	3	45/100
text-hyphen	3	48/100
method_source	3	48/100
fastthread	3	51/100
text-format	4	36/100
coveralls	4	42/100
shotgun	4	43/100
cucumber-wire	4	43/100
spoon	4	43/100
rack-mount	4	44/100
therubyracer	4	44/100
actionwebservice	4	44/100
rbench	4	44/100
codeclimate-test-reporter	4	44/100
mini_portile	4	46/100
abstract	4	47/100
mongrel	4	48/100
win32console	4	50/100
markaby	4	51/100
cgi_multipart_eof_fix	4	52/100
activestorage	5	47/100
colorize	5	48/100
pry-doc	5	48/100
rest-client	5	48/100
http-cookie	5	48/100
tool	5	48/100
polyglot	5	49/100
jsminc	5	49/100
multi_test	5	50/100
activeresource	5	50/100
guard-compat	5	52/100
mini_portile2	5	52/100
redis	5	52/100
temple	5	52/100
less	5	53/100
eventmachine	5	54/100
coffee-script-source	5	54/100
tilt	5	54/100
erubis	5	54/100
diff-lcs	5	54/100
rubyforge	5	57/100
yard	5	60/100
fakeredis	6	49/100
launchy	6	50/100
slim	6	52/100
simple-random	6	53/100
term-ansicolor	6	53/100
oedipus_lex	6	53/100
haml	6	53/100

andrew commented 6 years ago

Taking text-format as a low scoring project as an example of things in possibly change:

{
  "overall_score": 36,
  "popularity": {
    "score": 0.0,
    "dependent_projects": 0.0,
    "dependent_repositories": 0,
    "stars": 0,
    "forks": 0,
    "watchers": 0
  },
  "community": {
    "score": 30.0,
    "contribution_docs": {
      "code_of_conduct": false,
      "contributing": false,
      "changelog": false
    },
    "recent_releases": 0,
    "brand_new": 100,
    "contributors": 0,
    "maintainers": 50
  },
  "quality": {
    "score": 66.66666666666666,
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": false,
      "keywords": false,
      "readme": false,
      "license": false
    },
    "status": 100,
    "multiple_versions": 0,
    "semver": 100,
    "stable_release": 100
  },
  "dependencies": {
    "score": 49.0,
    "outdated_dependencies": 0,
    "dependencies_count": 99,
    "direct_dependencies": {
      "text-hyphen": 48
    }
  }
}

[x] no accessible source repo (homepage is pointed at rubyforge which is dead)
[x] readme and contribution_docs probably shouldn't be false if no repo is present, should just be nil and skipped
[x] contributors should be nil and skipped if repo isn't present
[x] stars, forks and watchers should be nil and skipped if repo isn't present

Looking inside the source of the gem, there is:

changelog
readme
Mentions that license as Ruby or Artistic in readme

We can definitely add support for detecting changelog and readme to the version-level metadata and feed that back in here once complete.

andrew commented 6 years ago

Updated the calculator to not punish projects that aren't on GitHub, here's the new breakdown for text-format, with an increased score from 36 to 42:

{
  "overall_score": 42,
  "popularity": {
    "score": 0.0,
    "dependent_projects": 0.0,
    "dependent_repositories": 0,
    "stars": null,
    "forks": null,
    "watchers": null
  },
  "community": {
    "score": 50.0,
    "contribution_docs": {
      "code_of_conduct": null,
      "contributing": null,
      "changelog": null
    },
    "recent_releases": 0,
    "brand_new": 100,
    "contributors": null,
    "maintainers": 50
  },
  "quality": {
    "score": 68.0,
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": false,
      "keywords": false,
      "readme": null,
      "license": false
    },
    "status": 100,
    "multiple_versions": 0,
    "semver": 100,
    "stable_release": 100
  },
  "dependencies": {
    "score": 49.0,
    "outdated_dependencies": 0,
    "dependencies_count": 99,
    "direct_dependencies": {
      "text-hyphen": 48
    }
  }
}

andrew commented 6 years ago

Other related areas to think about when it comes to different levels of support for package manager features we have:

[x] if we don't support parsing dependencies for that package manager, the dependency score should be skipped
[x] if we don't support parsing maintainers for that package manager, the maintainers score should be skipped

andrew commented 6 years ago

[x] I wonder if the outdated_dependencies_score should be a percentage based on the number of dependencies it has that are outdated

andrew commented 6 years ago

Probably also want to skip the dependent_* popularity scores if we don't have the support for it in that ecosystem, just wondering how many ecosystems that will affect:

Name	Dependent Projects	Dependent Repos
Alcatraz	false	false
Atom	true	true
Bower	false	true
CPAN	true	true
CRAN	true	true
Cargo	true	true
Carthage	false	true
Clojars	false	true
CocoaPods	false	true
Dub	true	true
Elm	true	true
Emacs	false	false
Go	false	true
Hackage	false	true
Haxelib	true	true
Hex	true	true
Homebrew	true	false
Inqlude	false	false
Julia	false	true
Maven	true	true
Meteor	false	true
npm	true	true
Nimble	false	false
NuGet	true	true
Packagist	true	true
PlatformIO	false	false
Pub	true	true
Puppet	true	false
PureScript	false	false
PyPI	true	true
Racket	false	false
Rubygems	true	true
Shards	false	true
Sublime	false	false
SwiftPM	false	true
WordPress	false	false

andrew commented 6 years ago

Three of the double false package managers in the table are editor plugins and don't really do dependencies: Alcatraz, Emacs, Sublime

The others are either smaller, we don't have any support for versions or don't have a concept of dependencies: Inqlude, Nimble, PlatformIO, PureScript, Racket, WordPress

andrew commented 6 years ago

Atom is also a little weird here because it depends on npm modules and uses package.json, so doesn't really have either, but is flagged as having both, basically all the editor plugins don't really work for dependent_* scores

amyeastment commented 6 years ago

Three (very early) concepts for the popover that explains what the SourceRank 2.0 rating is screen shot 2018-04-06 at 11 16 23 am

A few known issues we need to tackle:

How to present this such that someone understands higher score = better
What we want to relay - issues only, or the full status?
The branding of SourceRank 2.0 - what it's officially called, the color scheme/branding (not to mention making sure we choose something accessible!)
How we elegantly handle having some bits of data for some ecosystems, but not having bits of data for other ecosystems - do we just not show things, or call out we don't have that data in the UI?
The "Meta-sourcerank 2.0" when looking at the dependencies for a particular project and how we present that
@andrew raised the point that comparing the data across ecosystems is kinda like apples and oranges since different communities have different levels of expected activity/quality/tolerance for things like total dependencies...and I think it highlighted for me that we may want to try as much as possible to get a user to select their ecosystem first before showing them lists of packages where they might be comparing their SourceRank (like in search results or Explore, for example)...just food for thought.

andrew commented 6 years ago

A couple other screenshots of bits I was experimenting with on Friday in this branch: https://github.com/librariesio/libraries.io/tree/sourcerank-view

screen shot 2018-04-06 at 14 50 16

screen shot 2018-04-06 at 15 23 13

andrew commented 6 years ago

Making progress on Sourcerank 2.0 (now known as Project Score because trademarks :sweat_smile:) again, I've merged and deployed https://github.com/librariesio/libraries.io/pull/2056 and have calculated the scores for all the rust packages on Cargo, will report back one the score breakdowns shortly.

andrew commented 6 years ago

Graphing the distribution of sourcerank 1.0 scores of all cargo modules (raw data):

vs the distribution of sourcerank 2.0 scores of all cargo modules (raw data):

andrew commented 6 years ago

Similar graphs for Hex, the elixir package manager:

Sourcerank 1.0 distribution:

screen shot 2018-04-18 at 14 27 11

Sourcerank 2.0 distribution:

screen shot 2018-04-18 at 14 27 16

andrew commented 6 years ago

Projects with low scores are receiving quite a large boost from having zero or very few high scoring dependencies, which made me think that maybe we should skip dependency scores for projects with no dependencies.

But @kszu made a good point in slack, rake is very highly used and has no dependencies which is seen as a plus, skipping the dependencies score for that would lower its score.

We do skip the whole dependency block on a per-ecosystem basis if there's no support for measuring dependencies, but if we do support it, it feels like we should keep the same set of rules for each package within a given ecosystem.

andrew commented 6 years ago

Successfully generated the new scores for all rubygems, distribution curve is looking pretty good:

rubygems sourcerank 2 distribuion

andrew commented 6 years ago

I've now implemented a ProjectScoreCalculationBatch class which calculates scores for a number of projects in a single ecosystem and returns the dependent projects on the ones where the scores changed.

This then has a set of queues stored in Redis (one for each platform), which contain the ids of projects that need a recalculation, which slowly empties after each run and avoids recalculating things over and over.

Overall calculating the score for a number of projects in an ecosystem is much faster than sourcerank 1.0, mostly because the ProjectScoreCalculationBatch preloads much of the information required from the database.

Projects from Rubygems, Cargo and Hex are automatically being queued for recalculation after being saved, will be enabling more platforms once the initial scores have been calculated

andrew commented 6 years ago

It's now enabled on: alcatraz, atom, cargo, carthage, dub, elm, emacs, haxelib, hex, homebrew, inqlude, julia, nimble, pub, purescript, racket, rubygems, sublime, swiftpm

ProjectScoreCalculationBatch.run_all is being ran on a cron job every 10 mins, if all goes well over night will do the initial score calculations for some more of the larger platforms.

Next steps:

[ ] start storing the score breakdowns in the database
[x] add a page that shows the most recent breakdown
[ ] index score into elasticsearch

andrew commented 6 years ago

Project scores are now being calculated for all platforms, the backlog is pretty long, will likely take 24 hours to work through all the calculations.

andrew commented 6 years ago

Adding a basic score overview page:

screen shot 2018-04-24 at 14 46 38

andrew commented 6 years ago

Going to work on improving the explanations around each element of the score breakdown page as well as including the raw data that goes into the breakdown object (number of stars, contributor count etc) so that it can be stored in the database in a way that doesn't require the calculator to load data on demand as well as being able to show historic changes for each element.

gpotter2 commented 6 years ago

Hello,

I have a question about sourcerank: what is the point of using GitHub stars?

I don't really understand what sourcerank is aiming to do: half of its elements are about the project's health, and how close it is to standards (Basic info, Readme, license...), whereas the other half is about the project's popularity (contributors, stars and dependants).

If sourcerank was aiming at showing how wealthy is a package, the "popularity" information would be useless, and its whealthy requirements should be even stricter. On the other hand, if the goal is to compare it with it forks, clones or similar projects, then the impact of the "popularity" information should be more important.

For instance, imagine two similar projects are doing the same thing and are perfectly well configured. They will mostly have the very same sourcerank. Even if one has about 3000 stars, the other only has 400, they will have the very same "github stars" impact (log(400) = 2.6 rounded to 3, log(3000) = 3.4 rounded to 3). Maybe the use of logarithms is too strong.

Some projects are one-man projects, updated once in a while. It does not reflects the project's activity. Github's pulse system is super efficient: maybe sourcerank could also be based on the activity of the project (number of issues / PRs / releases per week/months/year/whatever), what GitHub calls "code frequency". Maybe it could also register the number of forks, or if the project has a readthedocs page...

Is it planned in the 2.0 version to fix those issues? I've seen great improvements above, and was also wondering if the way those data was calculated has drastically changed or not.

I really enjoy libraries.io, and thank you all for your amazing work !

Regards,

filips123 commented 5 years ago

One more problem (I think I has already been posted here) is that many projects have 0 SemVer score because they didn't follow SemVer in the early releases. How do you plan to fix that?

Also, there could be a problem with outdated dependencies. Maybe you could only count this if more than one dependency is outdated or depending on release type (major, minor, patch).

Some problems are also with not brand new. Sometimes the score for this is 0 because project has changed name or repository.

When do you plan to release SourceRank 2.0 to main website?

librariesio / libraries.io

SourceRank 2.0 #1916

SourceRank 2.0

Goals:

History:

Problems with 1.0:

Things we want to avoid:

Proposed changes:

Potential 2.0 Factors/Groupings:

Reference Links: