Task to actively pre-cache all of rubygems.org

indirect commented 9 years ago

If you're going to, say, railscamp, or mainland china, you might want to grab a copy of all the gems. Or at least the newest version of each gem. A task to make this a single command would be swell.

smellsblue commented 9 years ago

Maybe something like gemstash preload? or maybe even a way to preload specific gems? like gemstash preload rails?

How should all versus latest work? Maybe gemstash preload --latest?

What if you do latest and some of them have dependencies on older versions? Should it do its best to ensure all gems have their dependencies preloaded?

pcarranza commented 9 years ago

Where to stop?

Pull all the latest gems and their dependencies? Then cut in the first degree of the graph?

That would still be a lot, but not all the gems and versions in the world.

indirect commented 9 years ago

no idea :) maybe just everything? I know that's what gem mirrors destined for inside the great firewall and railscamp servers do.

pcarranza commented 9 years ago

What about a fetch all plus to the fetch latest.

One for the great wall, the other one a bit smarter.

smellsblue commented 9 years ago

@indirect keep in mind certain requests will still be directed towards rubygems.org, such as dependencies... those cache for 30 minutes and then will get re-fetched from rubygems.org after that... and the cache is in memory, or optionally a memcached server, so preloading won't guarantee no web requests (at least as it stands now).

Do we need to have a mode for gemstash that deals with dependencies in a different manner, or would that work?

indirect commented 9 years ago

if gemstash caches the full index files (specs.4.8.gz, latest_specs.4.8.gz, and prerelease_specs.4.8.gz), then --full-index mode will work!

indirect commented 9 years ago

it might make sense to have a flag that explicitly puts gemstash in "no backing server" mode?

pcarranza commented 9 years ago

Offline mode, yes

pcarranza commented 9 years ago

What about, when going in offline mode we pull the specs files, cache them and build the dependencies data without an expiration.

When we leave offline mode, we evict all this cached data.

indirect commented 9 years ago

If you have a local server, the dependency API isn't even needed because you can fetch the full index so quickly. Plus, the new index will cache the index on each client, making the dependency API even less needed. (And it will automatically cache the index as part of the fetching process. New index support is basically Gemstash 1.1 :)

pcarranza commented 9 years ago

Then there is no need for building the db. Just caching the specs files if enough.

And the gems, of course.

pcarranza commented 9 years ago

Ok, I've just pulled the latest specs and unmarshalled it to see how it looks like.

It seems quite straightforward to build the gem names with that and pull them all (prefetching).

The only but is, as @smellsblue said, some gems may require older dependencies that are not in that list.

So, how about having a "--deep" option which is going to go through all the gems from this file and will resolve dependencies for those so it picks a "consistent" snapshot of all the gems (this can take a lot of time)

By default we just pull the latest of everything and then the user can train gemstash with whatever else he/she wants to have locally resolved. Something like "gemstash prefetch rails" which is also going to resolve those dependencies and will pull all the required gems building a consistent cache.

Then, when going into offline mode, we fetch those specs files and just do not redirect anymore.

Thoughts on this?

indirect commented 9 years ago

The specs array is the exact same format as the latest specs array—as you have both pointing out, it's basically impossible to know which older gems you'll need.

I think we should trash the idea of "latest" and only provide a script (rake task?) that can prefetch all the gems that exist. We should warn them that it can take hours/days to do this before actually starting, and print progress out as it runs (there's a pretty okay UI for this in the bundler/new_index repo, which downloads every single gemspec using 50 threads).

At the same time as we ship the prefetch script, we should probably also ship a "backend gone" mode that doesn't even try to talk to rubygems, just serves out of the already existing cache.

smellsblue commented 9 years ago

Would "backend gone" mode require --full-index, or are we going to try to serve dependencies from the locally cached gems?

Also, I have an idea for prefetching specific gems.... we could create a Gemfile that just points to the running gemstash and then lists all the desired gems, then bundles in the background.

indirect commented 9 years ago

A prefetch endpoint that accepts a Gemfile.lock as input sounds great.

We'll be able to use the new index to offer backend missing mode without having to resort to full index. :)

pcarranza commented 9 years ago

A good old brute force attack is not so bad neither:

!/usr/bin/env ruby

require "faraday_middleware"
require "zlib"
require "stringio"
require "thread"

def create_connection
  Faraday.new "https://www.rubygems.org/" do |c|
    c.use FaradayMiddleware::FollowRedirects
    c.adapter :net_http
  end
end

def get_specs
  con = create_connection
  p "Downloading specs..."
  req = con.get "/specs.4.8.gz"
  p "Inflating specs..."
  reader = Zlib::GzipReader.new(StringIO.new(req.body.to_s))
  Marshal.load(reader.read)
end

def build_work_queue(gems)
  p "Processing gems..."
  work_q = Queue.new
  gems.take(1000).each do |gem|
    (name, version, _) = gem
    gem_name = "#{name}-#{version.to_s}"
    work_q.push(gem_name)
  end
  work_q
end

def download_gems(work_q)
  p "Downloading gems"
  semaphore = Mutex.new
  workers = (0..20).map do
    Thread.new do
      con = create_connection
      begin
        while gem_name = work_q.pop(true)
          req = con.head("/gems/#{gem_name}.gem")
          semaphore.synchronize do
            p "#{gem_name} - #{req.headers['content-length']} bytes, #{work_q.length} gems left"
          end
        end
      rescue ThreadError
      end
    end
  end
  workers.map(&:join)
end

download_gems(build_work_queue(get_specs))

./pull_all_the_gems.rb  79.82s user 5.31s system 77% cpu 1:50.05 total

Total number of gems right now is 594,815, which means that in this conditions it would take roughly 20hs, which is not so bad.

A rake task like this, and the offline mode that would pull those spec files and we can provide the great wall caching mode.

smellsblue commented 9 years ago

Is this purely limited by bandwidth and IO, or would it maybe get a speed boost by going multi-process?

Would this mode be putting stuff in the database and behaving like all these gems are private?

If so, could SQLite handle that many records, or should we require another database?

In this mode, presumably no separate upstreams and no private gems, so all requests would go to this cache of gems?

Should there be a way to then fetch the latest gems?

I think we might hit filesystem limits with that many gems. I've seen default filesystem settings get hit where no more files in a directory can be created... we might need to restructure our storage to avoid this problem.

pcarranza commented 9 years ago

In this case it is limited by latency rather than bandwidth as I am only doing head requests. I do not think that multi processing will help here as this is not CPU bound, but rather IO bound.

I don't think this should add anything to the DB, this would be to force having a local copy of all the gems in the world and history from just one upstream. Quite naive and brute-forcy, but it is a way of pulling that data.

Regarding hitting filesystem limits, depends on the filesystem, ext3, definitely, in fact the limit is 32767 child nodes of a given inode, so yeah. I had been using a flat system before in xfs successfully with a larger data set, and I will assume that both HFS or ext4 will support it.

Anyway, I agree that if we plan to handle this kind of dataset it makes sense to handle it better, 2 ideas that may make sense is to use a trie like structure; say using the first 3 chars of the gem name, if they are that long, something like /r/a/i/rails-version. And then to remove the version from the gem name, something like /.../rails/1.0.0, etc, etc.

Both approachs will distribute the files and probably enable support for this volume of files.

I will do some math later to check what would the worst case be.

pcarranza commented 9 years ago

It's not so terrible, the gems that have more versions are around 630 different values

["caboose-cms", 627],
 ["arvados-cli", 580],
 ["gherkin", 533],
 ["arvados", 512],
 ["dev", 511],
 ["heroku", 426],
 ["haml-edge", 330],
 ["rbbt-util", 320],
 ["inline_forms", 312],
 ["base2_cms", 309],
 ["flights_gui_tests", 305],
 ["vmail", 281],
 ["gds-api-adapters", 273],
 ["dynarex", 272],
 ["specinfra", 272],
 ["alpha_omega", 270],
 ["pry", 269],
 ["smock", 248],
 ["serverspec", 238],
 ["rvm", 237],
 ["fog", 232],
 ["picky-client", 231],
 ["picky", 230],
 ["rexle", 229],
 ["RubyApp", 226],
 ["sprout", 219],
 ["rails_apps_composer", 217],

I think we can safely assume that it will take a while until we exhaust an ext3 filesystem with different versions of just one gem. There are 107228 different gems, if we build the trie and we store the different versions of the gems in one folder, then it may make sense.

I'll keep digging.

pcarranza commented 9 years ago

A trie with a depth of 2 will leave us with a max number of different versions of 14500 aprox, which would be hidden inside the gem name, so not a problem at all

["re", 14545],
 ["ra", 14171],
 ["co", 13858],
 ["ca", 11874],
 ["ru", 11861],
 ["ac", 10660],
 ["mo", 10129],
 ["si", 8968],
 ["ma", 8934],
 ["sp", 8048],
 ["bo", 7635],
 ["pa", 7596],
 ["se", 7553],
 ["st", 7004],
 ["de", 6859],
 ["fl", 6184],
 ["mi", 6172],
 ["lo", 6124],
 ["pr", 5832],
 ["me", 5822],
 ["li", 5512],
 ["ha", 5375],
 ["ch", 5128],
 ["ba", 5113],
 ["fa", 4881],
 ["tr", 4776],
 ["te", 4771],
 ["ge", 4713],
 ["gi", 4636],
 ["sa", 4420],

Thoughts on this?

smellsblue commented 9 years ago

:+1: I think 3 level or 2 level depth would be fine, maybe 3 level just for the added protection, or is that unnecessary? If we do 3 level depth but keep the version number in the directory name, would that make sense? That way, we could just change the storage class rather than needing to pass in the version separate from the gem name.

pcarranza commented 9 years ago

I'll do the math, but I think it will.

pcarranza commented 9 years ago

More data

Processing gems using 20 workers
100000/100000
Total bytes to be downloaded: 24.04 GiB
./pull_all_the_gems.rb  842,07s user 95,62s system 7% cpu 3:21:46,46 total

So yeah, a day of downloading roughly.

Versions with a group of 3:

"act", 9541],
 ["rub", 9222],
 ["rai", 5129],
 ["cap", 4909],
 ["mon", 4613],
 ["con", 4578],
 ["sim", 4157],
 ["spr", 3820],
 ["rac", 3771],
 ["res", 3621],
 ["bos", 3494],
 ["git", 3467],
 ["log", 3281],
 ["pro", 2862],
 ["sta", 2776],
 ["flu", 2763],
 ["com", 2681],
 ["red", 2638],
 ["dev", 2633],
 ["vag", 2622],
 ["omn", 2622],
 ["aut", 2589],
 ["ope", 2488],
 ["tra", 2433],
 ["for", 2368],
 ["che", 2227],
 ["min", 2165],
 ["has", 2142],
 ["boo", 2083]

pcarranza commented 9 years ago

I don't even think we need to change the storage, we just need to do something like

storage.for(first_char).for(second_char).for(third_char).resource(name)

smellsblue commented 9 years ago

Fair enough, we could use it as is; however to me, pushing that on the user of the storage class means that every user of the class now has to not be in sync and use the same scheme, but also every user of the storage class has to worry whether they might run into file system limits, rather than it be solved in one place, within the storage class :-)

This might mean use cases that are unlikely to hit the problem would still get the protection (private gems probably wouldn't hit the limit, but you never know, there might be a user or two that would run into this problem), but I don't really see a harm in that.

pcarranza commented 9 years ago

Does that mean that there is a GemStorage concept lurking there or that you think that everything in the storage should be handled as a trie like structure?

I have the feeling that gems have specific requirements for storing them, basically because there will be cases where the gem name will be --1.gem (which happens to be at the beginning of the specs file, who would have thought that you can name your gem "dash char"?), there is also another one called "underscore char".

In a nutshell, I think that storing the gems is a concern in itself, and that we probably need to handle that separate from the general storage case.

That said, I still don't have enough information to make a decision about the implementation details for this, so I just don't know if all this complexity should live in the "resource" method, or if it should be extracted to some object that knows how to handle this.

Take a look at this gem names:

_-1.2.gem
--1.gem
0mq-0.5.3.gem
0xffffff-0.1.0.gem
10to1-crack-0.1.3.gem
1234567890_-1.1.gem
12_hour_time-0.0.4.gem
16watts-fluently-0.3.1.gem
189seg-0.0.1.gem
1_as_identity_function-1.0.1.gem
1pass-0.1.2.gem
21-day-challenge-countdown-0.1.2.gem

It all is full of corner cases like using "-" as a folder name.

smellsblue commented 9 years ago

I did a #26 to try it out... seems like pretty minimal footprint to make the Storage class protect against it in the general case... feel free to share thoughts here or in that PR :-)

smellsblue commented 9 years ago

- doesn't seem to be a real problem, at least on my linux laptop... I created the directory fine (though trying to change into it seems a bit harder)

pcarranza commented 9 years ago

Yeah, the - and the . directories will be interesting cases. Particularly the latter.

rubygems / gemstash

Task to actively pre-cache all of rubygems.org #18