Update Sqlite backend to work with latest libgit2

mhodgson commented 10 years ago

I recently updated this backend in order to get my own different custom backend working. Figured I would submit here as an example for others to update the remaining backends.

Also, I create a refdb backend for sqlite, but I needed to import header files from the libgit2 source, not just what is exposed in the /includes directory. Happy to add that here too, but I feel like I'm either doing something wrong, or the interface for the refdb backends needs to be updated to reasonably allow for one to be created with the exposed libgit2 functions (git2.h).

charypar commented 10 years ago

Slightly off topic, sorry. @mhodgson How did you go about building and testing it? I'm interested in doing the same for the redis backend (adding refdb) and I'm not entirely sure how libgit2-backends and libgit2 itself fit together...

mhodgson commented 10 years ago

@charypar well, I mostly followed some of what this article covers: https://deveo.com/blog/2013/06/19/your-git-repository-in-a-database-pluggable-backends-in-libgit2/ He provides the code to actually add the backends to a git_repository.

I ended up doing the testing in ruby (using rugged and a private gem with a c extension). I'm not really comfortable enough with c to figure out how to do testing with it.

Compiling was a little tricky since (like I mentioned above) the necessary functions and structures are not part of the public libgit2 API. I've been thinking about submitting a pull request on libgit2 providing updates to the public API that would allow for custom RefDB backends using only the public API.

charypar commented 10 years ago

@mhodgson Thanks! I see we went the same way. I got it working since, updated the redis odb backend and I plan to add pluggable backends support into rugged when I get redis working (in a different way than it was done before). Currently I just basically hacked redis into my fork of rugged, but that's not a viable solution.

I'm testing through rugged as well. It would be really nice to set up C tests for this repo as well.

If you're interested, check the WIP here https://github.com/redbadger/libgit2-backends/pull/1 and here https://github.com/redbadger/rugged/pull/1 I'm only just starting with the refdb, so the fun is not over yet :)

mhodgson commented 10 years ago

@charypar oh nice! The RefDB is significantly more complex, so buckle up. I completely punted on the reflog part of it since it isn't really needed for my application. Also it seems like for backends like this the typical use case is to host repositories (or just versioned data) in a web environment. Since the reflog is typically used locally, it didn't make much sense to pull that part in. I also didn't tackle supporting packrefs since I'm not sure pack files are really necessary when not using the file system.

Re: Rugged, I'm not sure adding support for this type of thing directly to rugged is the right approach. The approach I took was to create a separate gem that has a function to wrap an existing repo. Since the Rugged::Repository class is really just a git_repository under the hood you can easily just take the Rugged repo as an argument and wrap it up without depending on the Rugged code base at all. So something like this:

static VALUE git_set_redis_backend(VALUE self, VALUE rb_rugged_repo, VALUE conn_details)
{
  int error = GIT_OK;
  git_repository *repo;
  git_odb *odb;
  git_odb_backend *redis_odb_backend;
  git_refdb *refdb;
  git_refdb_backend *redis_refdb_backend;

  // Not Shown: use conn_details to setup redis connection in variable conn

  Data_Get_Struct(rb_rugged_repo, git_repository, repo);

  if (!error)// Add the ODB Backend
    error = git_odb_new(&odb);

  if (!error)
    error = git_repository_set_odb(repo, odb);

  if (!error)
    error = git_odb_backend_redis(&redis_odb_backend, conn);

  if (!error)
    error = git_odb_add_backend(odb, redis_odb_backend, 1);

  if (!error)// Add the RefDB Backend
    error = git_refdb_new(&refdb, repo);

  if (!error)
    error = git_repository_set_refdb(repo, refdb);

  if (!error)
    error = git_refdb_backend_redis(&redis_refdb_backend, repo, conn);

  if (!error)
    error = git_refdb_set_backend(refdb, redis_refdb_backend);

  return error;
}

Then wrap that with a ruby c extension and do something like this:

repo = Rugged::Repository.init_at('repo_path')
RedisGitBackend.connect_repo(repo, {host: '...', port: '...'})

Thoughts? That way the Rugged gem doesn't start depending on redis, sqlite, etc. etc. etc.

charypar commented 10 years ago

I agree completely that adding support for redis or sqlite directly to rugged isn't the right approach. That is just a temp hack. The desired final state is something like

require 'rugged-redis'

backend = RuggedRedis::Backend.new(some_opts)
Rugged::Repository.bare('my_repo', backend: backend)

The backend is coming from a separate gem, rugged only knows it can exist and know how to extract the odb and refdb backends from it. I find this more in line with libgit2, i.e. libgit2 supports backends but doesn't provide implementations, thus so should rugged. At least that's my thinking. Also it seems to have been the original intention of the authors, there was even support for writing backends in ruby, but it wasn't performing well from what I understand (https://github.com/libgit2/rugged/issues/39).

Wrapping is an interesting approach as well, though. Maybe we could combine the approaches and do something like repo.backend= that takes a Backend instance from an external gem wrapping some form of odb and refdb backends...

It would be good to get an opinion from some of the collaborators.

mhodgson commented 10 years ago

@charypar huh, yeah I like your approach too. Wonder if @arthurschreiber has any thoughts?

arthurschreiber commented 10 years ago

Yes, I do. :)

I've seen that the libgit2sharp folks have pluggable backends, that can be implemented in C# - I'd like to go with a similar approach. If that's really not an option from a performance point of view, we should see if we can implement separate gems for different backends.

mhodgson commented 10 years ago

@arthurschreiber wasn't that a previous feature of Rugged that was ripped out for performance reasons? It was before I was using Rugged but I'm pretty sure I unearthed something like that in my research.

Seems like the separate gem approach might be the right path. I guess we just need to come up with some generic structure (in c) that Rugged can recognize as a backend and plug into the repo. I'll see what I can come up with.

charypar commented 10 years ago

@mhodgson if you want to think about it together, I tend to hangout in #libgit2 on freenode these days.

charypar commented 10 years ago

@arthurschreiber @mhodgson This is the comment addressing ruby Backend removal for performance reasons https://github.com/libgit2/rugged/issues/39#issuecomment-3600332

charypar commented 10 years ago

@mhodgson After implementing about a third of the refdb, I'm almost sure you don't need the private header to get the git_reference struct. You can go through the public APIs (git_reference_target, git_reference_symbolic_target, git_reference_name).

mhodgson commented 10 years ago

@charypar that might be true, but what about the iterators? They need git_pool and git_vector which are both private.

charypar commented 10 years ago

@mhodgson Do they? Is that maybe specific to your implementation...? I'm not done with it so I could be wrong, but it doesn't seem like I'll need them ATM.

mhodgson commented 10 years ago

@charypar keep me posted. I tried to stick as close to the fs (filesystem) backend as possible but its very possible I didn't fully understand what I was doing and there is a simpler way. My implementation works, but I'm sure it's not ideal.

carlosmn commented 10 years ago

Iterators for external items, such as a backend, do not need anything from inside libgit2. The default backends might use them, but that's because they're in the same codebase.

On the subject of performance, It's not just the rugged/ruby conversions that make it slow (though that doesn't help either). Even staying in C, using an external database is going to cause issues. While looking at some bottlenecks, I saw that it takes us about 500ms to go through all the commits in the git repo (pretty rough number, but it's something that I measured recently). That history is made up of ~35k commits, which means that's how many objects we need to load from the database. That is, we used ~14us for loading each object.

The redis site claims that talking to it over a unix pipe can have latencies "as low as 30us". This is pretty quick for talking to a different process, but that's double the time it took us to load, and that's already without redis doing any processing or necessarily transferring the whole object over that pipe. Talking over localhost would take longer and have more overhead.

For SQLite, we're able to load the data into our address space, which should make that part a bit faster, but you still need to parse the SQL text and maybe copy the data over.

spearce recently sent a patch to git to simulate what it would be to have all objects in remote storage. The short version is that it's painfully slow.

What I'm trying to say is that whenever you're trying to use a different backend you should take a long hard look at how Git/libgit2 is able to use it. There is barely any possibility of paralellising object fetching, which means that you're paying for the full round-trip time for every object you want to look up, which quickly adds up. If you can come up with numbers that make sense for your organisation using this, then you should obviously not let me stop you, but I wanted to point out what the numbers are even before you get into what your db does. There is no way that this is not going to be noticeably slower than using the disc and letting the kernel put the data directly in libgit2's memory space.

charypar commented 10 years ago

@carlosmn Thanks for that detailed writeup! I finished the iteration already and yes, you don't need the insides of libgit2 to do it at all.

I'm aware of the performance problem, this is never going to perform great on big repositories, especially not over a network and in my mind that's just a tradeoff you're doing when storing remotely.

That said, for small to mid-sized repos it works reasonably quick for a web app to use and it's The Only Option^TM if you don't have a writable file system (ahem, ahem, Heroku). I think it's still worth doing this for that use case, it should just come with a warning that it won't be blazingly fast, or indeed fast at all.

I will come up with some numbers when I open a pull request on the redis backend, although I'll probably go through rugged (just because of the simplicity of doing it), so they might not be the greatest benchmark ever...

libgit2 / libgit2-backends

Update Sqlite backend to work with latest libgit2 #4