Gem author can see statistics about which ruby versions people use

jaredbeck commented 7 years ago

As a gem author, I want to see statistics about which ruby versions people are using, So I can make informed decisions about which rubies to support

As the author of a gem that depends on rails, I want to see statistics about which rails versions people use, So I can make informed decisions about which rails versions to support

This would be huge for the ruby community. Compatibility is an important goal of many gems.

I can think of a number of possible UIs for this, but first:

Do we have the data? https://github.com/rubygems/rubygems.org/issues/1335 seems to indicate that we do.
Would crunching the numbers be too great a burden for the infrastructure?

I'd be happy to do the legwork on this if someone with the commit bit is willing to sponsor it.

dwradcliffe commented 7 years ago

Do we have the data?

Currently no. There is some work started to send stats from the bundler client back to RubyGems.org but I don't think it's made it very far.

Would crunching the numbers be too great a burden for the infrastructure?

We have a few 3rd party services we use for stats, but nothing that would work well to expose on the website. At the moment we don't have the resources to store and process this data. I'm totally open to ideas and suggestions, although managing another data store seems unlikely right now.

dwradcliffe commented 7 years ago

Also, I think the idea is great and I agree it would be valuable to provide that information.

jaredbeck commented 7 years ago

Also, I think the idea is great and I agree it would be valuable to provide that information.

Great! Thanks for the quick response, David.

There is some work started to send stats from the bundler client back to RubyGems.org but I don't think it's made it very far.

Why the bundler client? That information isn't available server-side?

Would crunching the numbers be too great a burden for the infrastructure?

At the moment we don't have the resources to store and process this data.

I can read up on the infrastructure, assuming the easy to find resources like https://github.com/rubygems/rubygems-infrastructure/wiki/Architecture-Overview are up to date.

My naïve thought was that the final destination for the calculated statistics would be whatever data store holds gem profile data, like author names, links to repo, etc. Am I on the right track?

dwradcliffe commented 7 years ago

That information isn't available server-side?

With the CDN/caching layers we have in place (Fastly), our servers only see a fraction of the requests. So we can't use that data. We can send logs from Fastly for every request to something to process and store them, but that's a lot of logs.

As for storage, it's simply too much data to store in the postgresql database. We used redis to store similar data before, but it got to be too much data for redis to handle.

jaredbeck commented 7 years ago

So, if I understand correctly, we'll have to parse the logs from Fastly, aggregate that data, write the aggregated statistics to the postgres db, and display those stats on the gem profile. Is that correct?

We can send logs from Fastly for every request to something to process and store them, but that's a lot of logs.

Yeah, I'm sure. Any chance you can quantify that? Having a rough idea of the size of the raw logs, say, daily, will help with our planning.

Just to get the discussion going, let me throw out a few obvious architectures:

Parse logs and populate a columnar database, like redshift on AWS. This has the advantage that statistics can be recalculated at will, but I assume AWS would have to donate it, as it normally costs hundreds of dollars per terabyte per year.
Increment counts in redis (you've already said this is not feasible)
Increment counts directly in postgres (obviously not feasible, the write-load would be far too great)

Regarding the final aggregated statistics, let's see if we're on the same page; I'm picturing something like the following for each gem.

{
  ruby: {
    :collection_began_at => '2017-01-01',
    :total_downloads_since_collection_began => 3_000,
    :2.3.1 => 1_000,
    :2.2.5 => 1_500,
    :2.1.10 => 500,
    :jruby-1.7.25 => 0,
    :jruby-9.0.5.0 => 0,
    :rbx-3.28 => 0
  },
  rails: {
    :collection_began_at => '2017-03-01',
    :total_downloads_since_collection_began => 1_000,
    :5.0 => 500,
    :4.2 => 300,
    :4.1 => 200
  }
  # possibly other gems in the future
}

This would only add a few kilobytes at most to each gem record in the postgres database. Am I on the right track here? Were you picturing something similar?

ghost commented 7 years ago

I like the idea behind it so +1 - interestingly enough, at the least to me, I never thought about this. I mostly just use the latest ruby and sorta assume that everyone else uses it too. :D

Btw in the event that it may add too much data, how about an approximation? Something like, you know, "About 10% use ruby 1.8.x for this gem." or just for some major version data.

I mean the data structure is more accurate like: :5.0 => 500, :4.2 => 300, :4.1 => 200

But actually, would not 4.x and 5.x be sorta enough? Better than what we currently have.

So: :5.x => 500, :4.x => 500,

Could be even simpler, to keep track in iterations of 1000 :D :5.x => 5, :4.x => 2,

5000 and 2000! Minimal storing for maximum output!

jaredbeck commented 7 years ago

The final calculated statistics would only be a few kilobytes in each gem record in the postgres database, which is not a concern. When David says "it's simply too much data to store" I believe he means that tracking every download event (about 1 billion events per year?) in postgres or redis would be too much.

We can send logs from Fastly for every request to something to process and store them ..

How would these logs be batched? Daily, hourly? Hopefully not one HTTP request at a time :)

Would the "download event" that we extract from these logs represent a single gem install, or a bundle install?

.. would not 4.x and 5.x be sorta enough ..

Unfortunately, rails does not follow semver, and makes breaking changes in minor versions, so gem authors will want to know the minor version, 4.2, 5.0, etc.

indirect commented 7 years ago

@jaredbeck I would also dearly love to see stats like downloads per day/month/year. I think this would ultimately be better solved by a specialized event store that we can query for the actual data, rather than calculating a storing a tiny subset of it in postgres along with the actual critical gem data. Two examples of existing systems for this include npm statistics via map reduce and cocoapods statistics via Redshift. Since we already have a complete map reduce pipeline set up using Amazon SQS and Shoryuken, we may be able to make everything work by simply adding a little bit to that existing pipeline (although I don't know if Fastly supports logging arbitrary data sent in request headers, which is how Bundler currently reports Ruby version).

jaredbeck commented 7 years ago

Yay, with David and André both on board, I'm sure we can do this! #dreamteam

Since we already have a complete map reduce pipeline set up using Amazon SQS and Shoryuken, we may be able to make everything work by simply adding a little bit to that existing pipeline ..

Oh, that's great! I'm reading app/jobs/fastly_log_processor.rb now, which I assume is the shoryuken worker.

Speculate7 commented 7 years ago

Hey @jaredbeck are you still working on this?

jaredbeck commented 7 years ago

Hey @jaredbeck are you still working on this?

Yes. We are still discussing the architecture. There are (at least) two outstanding questions.

Log Processing - Can we use the existing log processing pipeline as André describes?
1. Bundler currently reports the ruby version in a request header. Can it also report the rails version? If so, will that pass through from Fastly?
Storage - Where do we store the data we extract from the logs? I've recommend that the data first go into a column store like redshift, that will perform aggregate queries well.
1. Can we get AWS to donate redshift for this? If so, is the rubygems ops team willing to maintain it?
2. Should we cache the results of such aggregate queries in the gem profile in postgres?

Speculate7 commented 7 years ago

Yes. We are still discussing the architecture. There are (at least) two outstanding questions.

I was working on a small part of this issue #1335 during the summer as part of the Rails Girls Summer of Code Fellowship. Long story short I was not able to finish my portion during the duration of the fellowship and I'm hoping to finish it up.

Log Processing - Can we use the existing log processing pipeline as André describes? Bundler currently reports the ruby version in a request header. Can it also report the rails version? If so, will that pass through from Fastly?

Bundler might possibly report the rails version since rails is a gem. The code that I wrote is basically a small program that captures the version and system information during the bundler install process (before the request goes to Fastly) and sends that data to the ruby gems server to be represented with Librato. I'm excited to hear more about your thoughts on this.

Storage - Where do we store the data we extract from the logs? I've recommend that the data first go into a column store like redshift, that will perform aggregate queries well. Can we get AWS to donate redshift for this? If so, is the rubygems ops team willing to maintain it? Should we cache the results of such aggregate queries in the gem profile in postgres?

Librato may or may not be addressing some of those requirements however, this is a solution I would love to learn how to develop and implement.

I am still an early stage programmer. If you're interested, I hope we can work together and see this project through to completion. The project files I was working on will be here, however in an effort to follow bundler PR conventions I must reset my development environment and make sure all tests pass. I have my code on my local device and will push it to my repo in a few days.

Speculate7 commented 7 years ago

@jaredbeck hey I created a class called gem_analytics_reporter can you give me some feedback on it?

jaredbeck commented 7 years ago

Hi Ore, Thanks for the contribution but I think we're still waiting to hear from David and André about the architecture and I don't think we're ready to write code yet.

indirect commented 7 years ago

@jaredbeck hey there, sorry it took so long to get back to you about this. David and I talked a bit at RubyConf, and I think we have decided at least enough to move forward.

There are a few separate things going on in this ticket, and I'd like to call them out and address them one at a time:

1. Metrics reported by existing Bundler versions: Bundler adds a USER_AGENT header to every HTTP request that includes Bundler, RubyGems, and Ruby versions, as well as various other data. In the past, we used the bundler/bundler-api Sinatra app to parse this data and send it to Librato, and then used that data to create the Bundler API dashboard. Since we retired the bundler-api Sinatra app, we have nothing collecting those metrics today. We want to extend the existing Fastly log processing pipeline to parse out this info so we have access to it again.

2. New Bundler metrics: At some point, we would like to start sending metrics data directly from Bundler to the server, skipping the hacky USER_AGENT workaround that we used previously. It’s not great that we’re sending the same information over and over in dozens or hundreds of requests, rather than just sending it one time. This is the project that @Speculate7 was working on over the summer, and that is still ongoing.

3. Gem version download counts: As mentioned in the original post for this ticket, it would be great to have download numbers for individual gem versions. It's possible to capture these numbers from the existing log parsing pipeline. We don’t have a data store nailed down yet, but Redshift and Librato are the best candidates. However, version download counts are not the same as usage numbers (see next point).

4. Gem usage numbers: In addition to usage numbers, we would like to have stats on how many separate projects use a gem, counted by version. IIRC our last idea about how to do this is to take a SHA256 hash of the git remote URL, and use that to index gem usage for that single project. IMO, knowing that an older version of your gem is in use by a lot of projects is much more useful than knowing that it was downloaded a lot of times. These are (I think) the numbers that that would be most useful to the community at large, even more than raw download counts.

Now let me try to answer your questions about processing and storing the numbers:

Log Processing - Can we use the existing log processing pipeline as André describes?

The current log pipeline counts downloads. I think we can extend it to count downloads per version pretty easily, as long as we have somewhere to send the data. We will also need to extend the existing log processor to get information from older versions of Bundler.

Bundler currently reports the ruby version in a request header. Can it also report the rails version? If so, will that pass through from Fastly?

No, Bundler can't also report the Rails version (or any other gem version) in an HTTP request header. That’s not a good place to put data.

Storage - Where do we store the data we extract from the logs? I've recommend that the data first go into a column store like redshift, that will perform aggregate queries well.

I haven't used Redshift that much, but that seems like a reasonable option. The other option that David and I were looking at is Librato, which has already provided us with a donated account (and since they are a metric service/store, we can easily query for aggregates from them as well).

Can we get AWS to donate redshift for this? If so, is the rubygems ops team willing to maintain it?

We already pay for our AWS account, and I think we can continue to do that even if we start using Redshift as well.

Should we cache the results of such aggregate queries in the gem profile in postgres?

Yes, that seems like a good idea, and it's probably required if we're going to display these numbers anywhere on RubyGems.org.

@Speculate7 This ticket is probably not a good place to get feedback for your changes, since they are on Bundler—this ticket is about work on RubyGems.org. Please start by getting code review from your RGSoC coaches. After that, you can open a PR against Bundler and get feedback from the Bundler team. 👍

jaredbeck commented 7 years ago

David and I talked a bit at RubyConf, and I think we have decided at least enough to move forward.

That's great news. I wish I'd been there. Some day.

.. knowing that an older version of your gem is in use by a lot of projects is much more useful than knowing that it was downloaded a lot of times.

Absolutely, that would be ideal, and more than I had hoped for!

.. our last idea about how to do this is to take a SHA256 hash of the git remote URL ..

That's a clever way to anonymously identify projects. 97% of ruby projects use git, according to the Rails Hosting Survey 2016, so I think it's reasonable.

Speculate7 commented 7 years ago

@indirect as always, thank you for the continued advice and support.

indirect commented 7 years ago

Rereading my comment, I realized I left out the (maybe) most important part of my point about calculating gem usage: There's no way for us to capture usage numbers by processing the Fastly logs. However, it should be pretty straightforward to extend the new Bundler metrics system (once it exists) to also send a hash of the git url and gem names and versions. That's the core reason that all of these projects are intertwined. :)

jaredbeck commented 6 years ago

Closing due to inactivity.

hmistry commented 5 years ago

I was looking for stats by gem version downloads over time today to make a more informed decision for backward compatibility on a gem I'm working on. I read about the issues faced and understand it.

I think this thread should remain open as a feature request reminder for some day when there's resources available to support it.

@indirect I agree having gem active usage data is ideal but if we can't get that, then wouldn't having the download count by versions over time be the next best thing? At a high level, what matters in development, is roughly knowing the relative impact to making breaking changes. If a gem version downloads are decreasing over time (think slow long tail) then I can assume the apps have either updated to newer gem versions or are running but not being updated with newer gem versions or are no longer running. In each reason, I can conclude it's ok to make a breaking change. The question is what is the size of the impact. One might use the recent daily/weekly/monthly download count for a gem version or take a windowed slice and compare it with the total downloads of all versions or of newer versions. It's not perfect nor accurate, but in relative terms, the ratio can be helpful i.e. is it 5% or 40% of apps that will have to make a change someday when they decide to update the gems. For me that is enough.

Is there still interest in having this feature and what are the gating items?

indirect commented 5 years ago

We don’t currently have the devops capacity to administrate a system that tracks downloads over time. That’s what we had before, and we gave up after it took the site down repeatedly.

I think the new Ruby Toolbox efforts to ingest data dumps will give this information at a weekly or monthly level? Either way I would suggest investigating that first.

If the Ruby Toolbox numbers aren’t what you’re looking for, we need a proposal that would add data warehouse type capacity for us to store downloads over time without needing additional ops overhead to run it, and with graceful degradation that does not hurt the main site if it has issues.

On Mar 1, 2019, 4:15 PM +0900, Aditya Prakash notifications@github.com, wrote:

Reopened #1439. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hmistry commented 5 years ago

@indirect I see so it's the same concerns - I completely understand.

I did look at Ruby Toolbox and bestgems.org charts but it only shows total downloads of all versions over time and total downloads by version over time which doesn't provide the necessary insight.

I'm sure you know what I'm looking for but I'll add an example of Android version installs to showcase what the OP and I are looking in gem version stats, for future reference.

From this chart I can see versions <=2.2 are pretty safe to break compatibility but versions >=2.3 is not. Ruby Toolbox chart breakdowns doesn't provide this level of insight.

screenshot-2013-12-30-13 26 23

Perhaps in future, either we find a way to do it in a low overhead way or something changes that allows us to support this... let's keep this issue open.

Akrabut commented 5 years ago

This has been my GSoC project which is practically complete (aside from some test issues). This is the Bundler metric reporting functionality, and this is the backend API to store the metrics.

Mind you, some data (including gem versions) is not instrumented at the moment due to limitations with Datadog that's currently used in rubygems.org.

rubygems / rubygems.org

Gem author can see statistics about which ruby versions people use #1439