rubygems / rubygems.org

The Ruby community's gem hosting service.
https://rubygems.org
MIT License
2.31k stars 915 forks source link

Verify links listed in gem metadata #3398

Closed indirect closed 10 months ago

indirect commented 1 year ago

Verified links would get a cool checkmark next to them on the gem show page sidebar, and maybe one day we can hide unverified links behind a click-through thing to make it clear that the owner of the website has not opted in to being listed on rubygems.org.

Each time a gem is pushed, and periodically after that (once a week, maybe?), we should fetch the HTML for the URL given and check for proof that the link is "verified". What does verified mean, you ask? I'm glad you asked.

We're going to borrow from the link-verification scheme that systems like Mastodon use, and look for links with a rel attribute. Here are some example links that would be valid:

<html>
<head>
  <link rel="rubygem" href="https://rubygems.org/gem/mygem">
</head>
<body>
  <a rel="rubygem" href="https://rubygems.org/gem/mygem">
</body>

For URLs at github.com only, we would also accept links like:

<a role=“link” rel=“noopener noreferrer nofollow” href="https://rubygems.org/gem/mygem">

That would allow us to verify repos with the repo settings website value set to the rubygems.org gem URL.

segiddins commented 1 year ago

I think the a elements would have hrefs rather than srcs ?

indirect commented 1 year ago

🤦🏻 you are correct, why did I write src, these are not images.

simi commented 1 year ago

Sorry for the late response, I missed this to review at high-level. :cry: Any reason to not make similar proposals RFC? I'm scanning RFC repo often for similar cases.

I started by looking at implementation and I have few questions to start with not related to the implementation itself (that's the reason to share in here, not in the implementation PR).

Trying to understand metadata and linkset relation

TL;DR unify gemspec metadata URIs and linksets table, they are out of sync

I tried to understand how links are collected and stored, but I failed to follow the logic for now (I'll try later again). Looking at gemspec policy, following links are supported by specification policy.

# https://github.com/rubygems/rubygems/blob/2600ec81933c9ea59c5aa63abb051655208f669c/lib/rubygems/specification_policy.rb#L12-L21
  METADATA_LINK_KEYS = %w[
    bug_tracker_uri
    changelog_uri
    documentation_uri
    homepage_uri
    mailing_list_uri
    source_code_uri
    wiki_uri
    funding_uri
  ].freeze # :nodoc:

Looking at linksets table (used in initial implementation PR), not all of those URIs are stored in there.

https://github.com/rubygems/rubygems.org/blob/661fee4d29929b93fb0c0f34e469361a019662cf/app/models/linkset.rb#L4

https://github.com/rubygems/rubygems.org/blob/661fee4d29929b93fb0c0f34e469361a019662cf/db/schema.rb#L185-L196

linksets home column is populated by spec.homepage, for the rest I failed to find out where metadata are stored into linksets table. But it seems to be somehow out of sync, not all metadata URIs are stored in linkset. It gets also little confusing since homepage could be specified in two ways in specification (spec.homepage and spec.metadata['homepage_uri']).

I did quick lookup how linkset table is populated and it seems homepage is the only URI widely adopted (since it is required currently), see the stats bellow for more details. It could be partly caused by missing examples in generated gemspec. It would make sense to add all metadata links in gem template.

# https://github.com/rubygems/rubygems/blob/2600ec81933c9ea59c5aa63abb051655208f669c/bundler/lib/bundler/templates/newgem/newgem.gemspec.tt#L24-L26
  spec.metadata["homepage_uri"] = spec.homepage
  spec.metadata["source_code_uri"] = "TODO: Put your gem's public repo URL here."
  spec.metadata["changelog_uri"] = "TODO: Put your gem's CHANGELOG.md URL here."
linkset columns adoption stats columns | COUNT ----------------------|-------- total_gems | 188834 gems_with_home | 167263 gems_with_wiki | 3437 gems_with_unique_wiki | 3112 gems_with_docs | 8719 gems_with_unique_docs | 6853 gems_with_mail | 970 gems_with_unique_mail | 931 gems_with_code | 17306 gems_with_unique_code | 10497 gems_with_bugs | 8259 gems_with_unique_bugs | 7979 ```sql SELECT COUNT(id) AS total_gems, COUNT(id) FILTER (WHERE home IS NOT NULL AND home <> '') AS gems_with_home, COUNT(id) FILTER (WHERE wiki IS NOT NULL AND wiki <> '') AS gems_with_wiki, COUNT(id) FILTER (WHERE wiki IS NOT NULL AND wiki <> '' AND wiki <> home) AS gems_with_unique_wiki, COUNT(id) FILTER (WHERE docs IS NOT NULL AND docs <> '') AS gems_with_docs, COUNT(id) FILTER (WHERE docs IS NOT NULL AND docs <> '' AND docs <> home) AS gems_with_unique_docs, COUNT(id) FILTER (WHERE mail IS NOT NULL AND mail <> '') AS gems_with_mail, COUNT(id) FILTER (WHERE mail IS NOT NULL AND mail <> '' AND mail <> home) AS gems_with_unique_mail, COUNT(id) FILTER (WHERE code IS NOT NULL AND code <> '') AS gems_with_code, COUNT(id) FILTER (WHERE code IS NOT NULL AND code <> '' AND code <> home) AS gems_with_unique_code, COUNT(id) FILTER (WHERE bugs IS NOT NULL AND bugs <> '') AS gems_with_bugs, COUNT(id) FILTER (WHERE bugs IS NOT NULL AND bugs <> '' AND bugs <> home) AS gems_with_unique_bugs FROM linksets; ```

:thinking: I think it would be possible to start with unifying linkset and metadata URIs supported by RubyGems. Even the codebase is not sure about the proper place to find out some metadata and tries both metadata and linkset as a fallback.

Verifiability of links

Considering stats shared in previous block, let's focus on homepage only for now (to keep it simple). Most of the top domains used are hosted services with no way to change header value to pair with current way of verification. Currently it seems only github.com and rubygems.org are going to be supported.

homepage domains stats | count | domain |--------|--------------------------- | 129350 | github.com | 14415 | (empty) | 8745 | rubygems.org | 7156 | (NULL) | 941 | gitlab.com | 615 | bitbucket.org | 520 | rubyforge.org | 387 | slide.rabbit-shocker.org | 373 | elastic.co | 230 | devcamp.com | 219 | apimatic.io | 194 | ww.github.com | 188 | rubylearning.org | 167 | example.com | 152 | aka.ms | 135 | code.google.com | 132 | rubyworks.github.com | 126 | google.com | 90 | wiki.github.com | 76 | twitter.com | 76 | sixarm.com | 74 | rubyonrails.org | 73 | www1.eafit.edu.co | 69 | merbivore.com | 69 | thoughtbot.com | 66 | pragmaticstudio.com | 65 | decko.org | 64 | github.org | 58 | developer.mastercard.com | 58 | integrityapp.com | 57 | scrivito.com | 54 | spreecommerce.com | 53 | docs.diligentsoftware.org | 51 | pluginaweek.org ```sql select count(id), substring(home from '(?:.*://)?(?:www\.)?([^/?]*)') as home from linksets GROUP BY 2 HAVING COUNT(id) > 50 ORDER BY 1 DESC NULLS FIRST; ```

It seems gitlab.com and bitbucket.org are good candidates to be included in initial implementation as well. Looking into mail column, there groups.google.com is the most used service. How to verify that one? The same question pops for docs column often using rubydoc.info

What does verified mean, you ask?

I'm still not sure how useful is the whole verification process. What does verified mean for the user visiting rubygems.org page? Does it mean link is safe to follow? You can simply verify custom page and serve malicious content in there. Doesn't verification mark make false positive feeling that link is safe and verified by rubygems.org?

:thinking:

What about to infer metadata URIs for known services automatically? Let's say source_code_uri is populated with github, if bug_tracker_uri is not populated, we can check by API call if issues are enabled on the GitHub repo and automatically infer the URL. Same works for wiki, funding, changelog (scan for known changelog files like CHANGELOG.md, HISTORY.md, ...). That way we can really make those links safe. All needed is to somehow connect GitHub and RubyGems (using the same method as described in the original description). We can also make it somehow suggested to users (maybe with an email after initial push) with short explanation how to link those two together.

It would be possible to start just with GitHub URLs and add gitlab.com and bitbucket.org integration later. That would cover most of the cases.

Also what's proper homepage for common project with no real custom homepage? There's github.com mostly used (~130k) and rubygems.org (~9k). If I understand it well, homepage should be rubygems.org and source_code_uri should point to github.com. Maybe we can make this suggestion into gem specification policy as well (to help assigning proper values).