Gem Contents Ingestion Proposal

martinemde commented 1 year ago

RFC for serving gem contents.

indirect commented 1 year ago

This seems like a great starting point. If we are worried about the total S3 storage costs, we could set the CDN expiration very high, set the rate limit very low, use robots.txt to exclude scrapers, and only supply the contents on demand via a lambda or equivalent.

I believe that’s the reason that NPM supplies the files addressed by content hash—you get full deduplication for free, since any file that’s the same between versions has the same content hash and can be reused without taking up any more space. Might be worth considering for our system, or maybe we do that in the backend while Fastly caches the content under a URL that matches the filename?

simi commented 1 year ago

Feel free to ping me once this is ready for comments. I have some initial questions and notes to share.

martinemde commented 1 year ago

I'm replacing the first draft with a more focused second draft that deals only with the ingestion system. I will follow up later with an RFC for the API, and then there will probably be a user interface. This allowed me to make the document more focused and maybe, actually, finish it.

indirect commented 1 year ago

This looks great to me, and I think we could probably bring it out of draft to get broader comments, if you're happy with it.

I'm not 100% sure about a separate database row for every file in every gem. Could we get away with a jsonb field and one row per gem version? Even if we do that, won't that mean many copies of information about repeated files, like the filename and the content hash over and over? Are there other options that would be less repetitive?

I don't actually have a better idea of the top of my head, but I thought the question was worth asking in case someone else did have ideas. 😄

jchestershopify commented 1 year ago

A slightly-science-fiction-right-now alternative would be to offload this task to GUAC. This is the kind of data they're looking to collect and serve. We could either just point folks there in documentation or provide a wrapper that queries GUAC.

indirect commented 1 year ago

Backing all the way up, I'd like to reiterate the extremely high level set of goals we (eventually) want to work towards, built on top of the ingestion system:

content-addressed files API
gem contents API
- uses the content-addressed API to return the file with eg hash abc123 when eg rake-1.0.0/lib/rake.rb is requested
- is probably the basis for the future code search
fancy gem version content HTML pages (generated by rails app, cached by fastly)
- renders a gem version contents as fancy pages a la github’s repo browser
fancy gem version diff HTML pages (generated by rails app, cached by fastly)
- renders diffs between gem version contents as fancy pages a la github’s diff viewer
fancy code search result HTML pages (generated by rails app from sourcegraph results? something else?)
- uses a search cluster to render results from one version, all versions of a named gem, or all versions of all gems
- let's see if we can use sourcegraph.com for this
(maybe?) fancy code browsing, search, and navigation a la GitHub. maybe we just programmatically create repos for every gem and push new commits with the contents of every version. lol.

Moving back from those extremely high level goals to acceptance criteria for just this particular step, I think the access patterns this file and db schema will need to support are:

easily read a single file from a gem version (eg, gem contents rack 1.0.0 lib/rack.rb or GET api/gems/rack/1.0.0/files/lib/rack.rb can print the file contents, with just one HTTP request)
provide a manifest with file checksums for the entire gem, for possible users or future tools to use, something like GET /api/gems/rack/1.0.0/manifest. The manifest should include (at least) the file name and the content hash with the hashing algorithm included, like sha256-abc123.
provide file content when requested by hash, something like GET /api/gems/rack/1.0.0/content/sha256-abc123.

I'm not 100% sure about number 3. I might be able to be talked out of it.

Within those constraints, I think any solution that provides the API above and scales to about one pushed gem every 2-5 seconds is a workable solution.

jchestershopify commented 1 year ago

Backing all the way up, I'd like to reiterate the extremely high level set of goals we (eventually) want to work towards, built on top of the ingestion system: content-addressed files API

I think I'd like to pop the stack one more time. I get that the idea is to allow folks to retrieve gem contents without downloading and extracting the gem. What I am unclear about is for whom this is make-or-break functionality. Who is going to use this functionality? How many? How essential to their workflow is it in order to justify RubyGems handling the additional complexity and expense? Right now this effort seems to be justified in terms of itself, rather than referring to an external need.

indirect commented 1 year ago

Good point! We can start there. 😄

Problem statement: RubyGems.org hosts many packages that our users are going to install on their machines and then execute. It's important for users to be able to see 1) what code you are about to run, and 2) view diffs between versions when upgrading.

We can't just refer users to GitHub diff links because 1) gems are not required to provide their source in any publicly available repository, and 2) the gem build process can modify source files arbitrarily before tarring up a .gem, so even an official git repo is not the same as what is inside the gem.

The "diff between versions" feature is useful and (imo) necessary for RubyGems.org to provide. Since we definitely want to build that feature, it makes sense to also lay the groundwork so that we can potentially display authoritative contents of packages, for users to view, or for other tools that might be built.

jchestershopify commented 1 year ago

OK, I see the underlying cause - thankyou for the clear explanation @indirect! Could we expand the RFC to take this in? It may help a future passer-by to get their bearings.

martinemde commented 1 year ago

@jchestershopify @indirect I updated the ## Motivation section based on the feedback and discussion. Hopefully that captures it better. Thank you for helping us improve this proposal!

simi commented 1 year ago

To be honest, even after reading Motivation part, as an RubyGems user, I don't really see the value in building this. I had no need for this in all my "Ruby life". To see changes today, you can use "Review changes" link on gems page (https://rubygems.org/gems/rubygems-update pointing to https://my.diffend.io/gems/rubygems-update/prev/3.4.5). It is using diffend.io and there is no need to maintain this from limited power of 'rubygems.org` resources. To explore gem content, it is super simple to download, unpack and locally explore with my common tools like favorite editor.

The interesting part could be for maintainers to be able to cross-search across gems and versions. Something similar could be achieved today with gem-codesearch. It is actively used by @hsbt to check on potential breaking changes (like https://github.com/rubygems/rubygems/pull/6311#issuecomment-1404556676). But I'm not sure it is possible to build anything fast enough to be able to respond to HTTP request in reasonable time for this purpose with the current amount of gem codes. @hsbt would you mind to share some stats from gem-mirror size these days? Looking into sum of all versions from production DB it seems all gems together are around 728 GB today (packed). We can also ask other package managers (like NPM - https://www.npmjs.com/package/@rails/webpacker?activeTab=explore) about their experience and implementation details on this.

I would like to suggest to first decide what's the main reason for this functionality (user vs maintainer).

For user's one, it would be great to start with something simple (if we decide to move this forward). rubydoc.info like lazy approach seems nice to me. Build side-app downloading and unpacking gems on request somewhere to temporary storage with simple code browser (linked from RubyGems.org gem show page).

Other alternative would be to provide simple way (link on RubyGems page) of onboarding gem into any of the external online code editors (like GitHub Codespaces, CodePen, CodeSandbox, StackBlitz, ...).

hsbt commented 1 year ago

@hsbt would you mind to share some stats from gem-mirror size these days?

It's 201 GB.

In my view, I'm enough to investigate gem content by gem-codesearch and contents provide by dependebot alert like https://github.com/ruby/rubyci/pull/362 and https://github.com/puma/puma/compare/v6.0.1...v6.0.2 .

I would like to suggest to first decide what's the main reason for this functionality (user vs maintainer).

👍

IMO, this proposal is good rather than none if we have unlimited network, storage and CPU. I don't feel like the benefits are worth the cost now.

indirect commented 1 year ago

Thank you for the feedback. 🙏🏻

It is true that we can use Dependabot, GitHub, Diffend, and gem-mirror to read, navigate, and search gems today. Combined, these are a perfectly okay way to check on gems.

One of the tasks I am working on this year is how to improve RubyGems.org for both maintainers and for end-users. There are many items on both lists. One item for maintainers is gem code search. One item for end-users is viewing gem diffs between versions, as well as READMEs, CHANGELOGs, and other files inside a gem version.

I think it is worthwhile to build these tools into RubyGems.org itself. RubyGems is 20 years old this year, even older than git (!), and I want RubyGems to be able to continue for 20 more years. I believe that means RubyGems should be able to function without needing GitHub or Diffend to exist.

Right now we have enough funding (thank you Shopify and Sovereign Tech Fund) that this is affordable, and we can even work on other improvements at the same time. I am happy to discuss prioritizing improvements with all of you, but let's do that somewhere else so this poor RFC can rest. 😄

In summary:

gem contents and version diffs are mainly for ruby developer "users"
code search is mainly for gem developers, rubygems/bundler maintainers, and ruby-core "users"
this RFC can provide a base for gem contents, version diffs, and code search to build on top
we can build gem contents, version diffs, and code search while also working on other important improvements
I think we should do this so that we do not depend on any one company for this useful functionality

simi commented 1 year ago

@martinemde regarding to our discussion on Slack, would it make sense to cut the RFC for now to the minimal to reflect the first goal (phase) only -> be able to get gem structure and individual file contents?

Next goals could be:

add UI browser of gems
make it gems differentiable across versions (on CLI and UI)
make it searchable (would need probably different storage able to do code indexing)

martinemde commented 1 year ago

I've updated the RFC to capture the current behavior in rubygems/rubygems.org#3454.

indirect commented 1 year ago

Merging this RFC now that the implementation has been merged. 👍🏻

rubygems / rfcs

Gem Contents Ingestion Proposal #44