libgit2 / libgit2sharp

Git + .NET = ❤
http://libgit2.github.com
MIT License
3.21k stars 889 forks source link

Expose lower-level operations #127

Closed sc68cal closed 12 years ago

sc68cal commented 12 years ago

The code in Git-Tfs needs a way to pull in changesets from TFS and create commits, without touching the working directory or the index. Currently Git-Tfs shells out and uses the GIT_INDEX_FILE to create a temporary index - I'd like to replace that with libgit2sharp calls.

carlosmn commented 12 years ago

So here's a real-world need to use the library's git_treebuilder and accepting that a TreeBuilder object would be too low-level for the library, let's see if we can come up with a nice way of using it without doing so explicitly.

There is currently a Tree class which represents what is currently in an already-existing tree object in the repository. Lets say we add constructors like Tree() and Tree(Tree) which map to git_treebuilder_create() without and with a source tree; and then we put Tree.Add(name, oid, mode) which lets you insert paths into the tree. You'd then be able to use that tree when you call Tree.Write() (or something with a better name).

nulltoken commented 12 years ago

@carlosmn Wow, you're fast! Let me think a bit about this :)

sc68cal commented 12 years ago

You'd then be able to use that tree when you call Tree.Write() (or something with a better name).

Here's what we do currently

nulltoken commented 12 years ago

I may have a different proposal: add a RepositoryOptions parameter to a new overload of the Repository constructor.

var opts = new RepositoryOptions {
    WorkingDirectory = @"D:\path\to\a\dir",
    Index = @"E:\path\to\a\valid\index\file"    // might be a copy of the .git/index file
};

using (var repo = new Repository(@"C:\path\to\repo", opts)
{
   // do some stuff
}

This would leverage two repository setters functions from libgit2

@sc68cal Unless I've missed something, this would match your requirements and allow you to prepare your commits using the current Repository.Index API, commit them then update the branches without touching the "main" workdir and index file. As branches are immutable objects, one could even have multiple simultaneous Repository instances with different working directories and sharing the same object database. The next time one refresh/reload a branch it will silently benefit from the commits that have been added.

@carlosmn How do you feel about this?

Note: git_repository_set_workdir() doesn't prettify the path before setting it. Might be worth a fix.

carlosmn commented 12 years ago

Being able to specify a different index is useful, but it's not a real solution to this issue. Using a different index and calling git write-tree, like git-tfs does right now is a workaround for mainline git not having any way to generate objects and trees directly in the repository. It's how you implement the treebuilder when you don't have it. We have the ability to create data directly in the repository and we should make use of it.

Not only that, but the role of the index is to act as a way to exchange information between the working tree and the repository proper which is what you do if you have real files. If you're importing information from somewhere else and have to use an index, you'll have to create those files on the disc, add them and then write out the index as a tree, which adds a lot of latency and unnecessary playing around, which would be specially noticeable on system with slow fs metadata retrieval like NFS and to a lesser degree Windows (the Unices would still get hit by a single-use file, but not as much).

By forcing people to use an index instead of creating objects directly, we're adding an extra layer of abstraction (writing out the index is implemented with a treebuilder internally anyway) which costs time and effort. We can be better than git in this regard instead of re-implementing their workarounds.

dahlbyk commented 12 years ago

From a consistency standpoint, a Tree being a GitObject seems to preclude it from representing something that's not actually an object yet.

It seems reasonable to me to have a TreeBuilder constructed from a Repository and optional Tree, with get/add/remove support, eventually returning the new Tree on Write(). Not attaching it directly to Repository (like Index, Info, etc) would keep it partitioned from the core API to reinforce that it's not a primary use case.

nulltoken commented 12 years ago

@sc68cal I agree with @carlosmn that my proposal is nothing more than a workaround. Indeed, there's a need to be able to write to the odb. However, we need to define a dedicated API, as easy to use as possible. But this task may take some time.

Plan is to release next version (0.9) as soon as the libgit2 new-error-handling branch is merged. And I don't know if the Repository.ObjectDatabase would be ready to meet this milestone.

Basically, the choice is yours: wait for a full-fledged ObjectDatabase API or go the RepositoryOptions way.

Note: libgit2/libgit2@b78fb64d2f5c1b9e2e834e05e042271147c8d188 has been merged and makes the RepositoryOptions option less error prone.

nulltoken commented 12 years ago

a Tree being a GitObject seems to preclude it from representing something that's not actually an object yet.

@dahlbyk Agreed. Moreover, I'd prefer sticking on GitObjects being immutable.

We can be better than git in this regard instead of re-implementing their workarounds.

@carlosmn I don't see this as a "this XOR that" option. Not all Git users are able to write an essay about Git internals :) I'm committed to make the ObjectDatabase API as user-friendly/discoverable as possible, but I doubt it will be as easy to use, for a new-to-Git user, as staging a whole directory then issuing a commit.

the role of the index is to act as a way to exchange information between the working tree and the repository proper which is what you do if you have real files. If you're importing information from somewhere else and have to use an index, you'll have to create those files on the disc,

Agreed. As you've digged into git.git source code more than I did, would you know how filters are being applied on very large blobs. Is the blob entirely loaded in memory? How much complex would it be to stream an object down to the odb without knowing in advance its total size?

@dahlbyk @carlosmn Let's start the API design party!

How about the following signatures?

 - bool ObjectDatabase.Contains(ObjectId id);
 - Blob ObjectDatabase.Add|Create(string fullpathToFile);

Is there a need to write a TagAnnotation by itself without creating an entry in refs/tags/?

I'm playing with some ideas related to the writing of Trees. However, I need to ensure it's feasible. I'll try and come back with a rough proposal later today or tomorrow.

spraints commented 12 years ago

RepositoryOptions should be workable for git-tfs.

Let me think out loud through how I'd use it...

  1. Create a temporary working directory for files pulled from TFS.
  2. Create a Repository with an index somewhere in the .git directory, and the temp directory as the working directory.
  3. Tell the repository to reset the index to the commit/tree we want to start with. (In a fresh git-tfs repo, we would skip this step, or it would be the null commit/tree.)
  4. Proceed with the fetch.

Is it reasonable to assume that the working directory won't need to be completely loaded? i.e. when git-tfs wants to add just one file, it'll create a file in the right place in the temp working dir, and tell the Index to stage it, and then commit. The Index and Repository won't care that the rest of the working directory is missing. Does that sound right?

Also, would this work for an otherwise bare repo? It seems like adding a working directory to a bare repo should make the bare repo behave like a normal repo.

As an alternative (more along the lines of the object database), I started implementing Repository.HashAndInsertObject(path) (in spraints@936ede3055997c1022ac333d827aceb1783c7700), but haven't gotten it working. This method is a simple version of the plumbing command git-hash-object, and it's the simplest thing that would work for git-tfs, for now. (We can continue using git update-index for index manipulation, for now.) HashAndInsertObject would be a step towards exposing the object database, too.

The ideal interface for git-tfs during a git tfs fetch would be something like this:

sc68cal commented 12 years ago

How much complex would it be to stream an object down to the odb without knowing in advance its total size?

+1 to this idea - since we use a stream passed from the temporary file that TFS downloads.

nulltoken commented 12 years ago

@sc68cal The idea would be to implement the following signature:

Blob ObjectDatabase.Add|Create(StreamReader reader);

The StreamReader would have to be instantiated and disposed by the caller

nulltoken commented 12 years ago

I'm playing with some ideas related to the writing of Trees. However, I need to ensure it's feasible. I'll try and come back with a rough proposal later today or tomorrow.

@carlosmn @dahlbyk I've paired with @yorah for a couple of hours and we came up with this gist.

Warning: This is untested. The code is crappy.

This tries to tackle two main issues:

Thoughts?

carlosmn commented 12 years ago

I don't see this as a "this XOR that" option. Not all Git users are able to write an essay about Git internals :)

Right, I mentioned both things being useful. However, the usage mentioned by @spraints of how to use the different index is precisely the workaround I'd very much like to avoid, because it adds a lot of overhead.

I'm not sure what you mean by "user" here. I generally presume some knowledge of how git works if you're going to use the library. A user here is still going to be developer, whether they want to be or not. Making it simple is certainly good, but it's not bad to assume the user has some idea of what they're trying to achieve.

Is it reasonable to assume that the working directory won't need to be completely loaded? i.e. when git-tfs wants to add just one file, it'll create a file in the right place in the temp working dir, and tell the Index to stage it, and then commit. The Index and Repository won't care that the rest of the working directory is missing. Does that sound right?

Yes, you're right. When you stage a file, a blob gets created and its hash gets stored in the index, which is how we know what to put in the tree. You can see this by running something like git init -q; touch A B; git add A B; rm A; git status. A will still exist in the next commit because it's in the index. The working tree doesn't go directly into a commit.

As for the question of creating a commit by telling it a tree and a set of parents, the library supports that (plus author, committer and message info). The bindings don't, AFAIK.

How much complex would it be to stream an object down to the odb without knowing in advance its total size?

I think the library has streaming support again, though it gets icky because git objects have a variable-sized header which contains the full size of the data, so if you stream it, you need to store it somewhere and only once it's finished can you store the data in the real blob object. So it's better if you can know the size beforehand, but not critical (if you do know the size beforehand but would still like to stream it, we can probably put something in the library). I'm not sure how this relates to C# buffers or streams, so enlightenment is welcome.

As for the proposed API, I like it. Being able to say "add file /tmp/some-file_AFDSK as nice/name" is a really nice touch and it looks like TFS would really benefit from it (or that's what I read from the comment that what it first does is download files to a temp location) and it'd bypass any need for an extra index while presenting a similar API.

What I'd like to see as well is a way to say "add blob deadbeef as some/file/name with attributres 04000" to be used when you have the data in memory instead of a temp file (yes, I have an obsession with avoiding touching the filesystem) but that seems like it'd be a natural extension of the proposed API. There's still a treebuiler branch in my fork that should help with talking to the C library.

nulltoken commented 12 years ago

I've started to work on the ObjectDatabase. It's not done yet.

You can peek at the code in the topic/objectdatabase branch here.

nulltoken commented 12 years ago

@spraints Sorry for the late answer

  1. Create a Repository with an index somewhere in the .git directory, and the temp directory as the working directory.

The new index could be anywhere on the disk. It's not mandatory to have it in the .git folder.

Also, would this work for an otherwise bare repo? It seems like adding a working directory to a bare repo should make the bare repo behave like a normal repo.

Yes this would work.

As an alternative (more along the lines of the object database), I started implementing Repository.HashAndInsertObject(path) (in spraints@936ede3), but haven't gotten it working.

I think the Repository.ObjectDatabase.CreateBlob() from #135 should cover this.

A way to create a commit using the Index's tree, and [0..2] parent commit SHAs. (I assume this already exists.)

CommitCollection.Create() won't allow one to explicitly select the parent commit. This option will only be available as part of Repository.ObjectDatabase.CreateCommit()