libgit2 / pygit2

Python bindings for libgit2
https://www.pygit2.org/
Other
1.62k stars 387 forks source link

state of pygit2 #139

Closed cholin closed 10 years ago

cholin commented 12 years ago

With git_revparse_sigle I think we could improve object lookups a little bit. For example repo['HEAD'] or repo.lookup_object('HEAD') would be much more convenient to use instead of repo.revparse_single('HEAD'). At the moment the pygit2 api looks a little bit confusing because we just write bindings for c functions. I would like to have a clean well-structured pythonic Repository class instead of just a blind collection of git_* functions.

Something like the following (of course written in c...):

class Repository:
    def __init__():
        self.head = None
        self.path = ''
        self.workdir = ''

        self.is_detached = False
        self.is_orphan = False
        self.is_empty = False
        self.is_bare = False

        self.submodules = []
        self.references = []
        self.branches = []
        self.tags = []
        self.remotes = []
        self.history = []                    #rev walker

        self.config = Config()
        self.index = Index()
        self.status = Status()
        self.reflog = RefLog()

    def create_blob_from_buffer():
        pass

    def create_blob_from_file():
        pass

    def create_tree_from_dir():
        pass

    def create_tree_from_builder()
        pass

    def create_commit():
        pass

    def create_reference():
        pass

    def create_tag():
        pass

    def create_branch():
        pass

    def lookup_object():
        pass

    def lookup_reference():
        pass

    def pack_references():
        pass

    def checkout():
        pass

    def reset():
        pass

    def read_raw():
        pass

    def write_raw():
        pass

I don't know if it's really necessary to create all objects through Repository... As well I don't like the naming differences for object creations: repo.create_X or repo.TreeBuilder().write().

What do you think? Some suggestions?

carlosmn commented 12 years ago

It definitely makes sense to hide the C details behind objects that feel native to python. There's a couple of details:

This might just be implicit in your proposal, but it'd be nice to have the types in there as well.

jdavid commented 12 years ago

We certainly need a more friendly and consistent API.

cholin commented 12 years ago

You are right for the efficiency part (in-memory object creation), but I do not like this exclusive way of creating objects (especially trees). I think there has to be a more consistent solution. As well pygit2 is the only binding which forbids object instantiation.

jdavid commented 11 years ago

Can you show how it looks object instantiation in rugged, for instance?

cholin commented 11 years ago
repo = Rugged::Repository.new("path/to/repo")
oid = Rugged::Blob.create(repo, "a new blob content")
repo = Rugged::Repository.new("path/to/repo")
ref = Rugged::Reference.create(repo, "refs/heads/unit_test", "refs/heads/master")
repo = Rugged::Repository.new("path/to/repo")
entry = {:type => :blob,
         :name => "README.txt",
         :oid  => "1385f264afb75a56a5bec74243be9b367ba4ca08",
         :filemode => 33188}
builder = Rugged::Tree::Builder.new
builder << entry
sha = builder.write(repo)
jdavid commented 11 years ago

I don't know Ruby, so I will learn a little from the code.

For blobs and references I don't see any sensible difference between Ruby and Python. There is not a Rugged::Blob.new(....), and the repo must be passed at the creation time:

# Ruby
oid = Rugged::Blob.create(repo, "a new blob content")

# Python
oid  = repo.create_blob("a new blob content")

If we wanted pygit2 to look more like Ruby we would use an static method (though it is longer and in my opinion ugly):

oid = pygit2.Blob.create(repo, "a new blob content")

Regarding the tree builder, there is a difference in when the repo is passed: at the beginning (Python) or at the end (Rugged). I do not have an strong opinion here.

In any case, the tree builder is fundamentally different from blobs, references, ... in that it needs to be constructed progressively before it is written.

wking commented 11 years ago

On Fri, Nov 23, 2012 at 05:20:25AM -0800, cholin wrote:

Hmm I think lookup_object() is much more convenient and for the user it doesn't matter what function we call internally. There is a ongoing discussion about refactoring the api in #139.

While I agree that a clean Python API is more than a bunch of C-API wrappers, I think that it's worth minimizing unnecessary differences to keep the mental overhead of translating between the APIs small. In this case, Repository.lookup_object() has the same API as Repository.revparse_single(). The rename could confuse existing libgit2 users who are learning pygit2 for the first time. In some cases this would be worth it, but I don't understand how this rename makes the method clearer for virgin pygit2.

cholin commented 11 years ago

I my opinion the target users for pygit2 are not libgit2 developers. But maybe I'm wrong with that...

libgit2's api is getting cleaner and I think there will be soon a major release (Version 1.0). So it would be good to have a consistent api in near future. Maybe there are some pygit2 hackers at http://git-merge.com/ in May in Berlin, so we can discuss this there. But I would prefer to come to a decision earlier!

jdavid commented 11 years ago

I am (slowly) working on the docs. Now going through every function to add arguments (issue #85), and to use PyDoc_STRVAR for every docstring. If you want to help, you are welcome.

This issue is large. I think it would be simpler to split it by discussing and fixing one topic at a time (index file, references, etc.). I also like to look at the problem through the documentation, so I wanted to get the docs up-to-date first.

PS : I didn't know about the Git merge event, I may be there ...

esc commented 11 years ago

I have some concrete Ideas for this issue and for #71.

Basically, when I tried pygit2 for the first time, it seemed quite cumbersome to handle simple things, e.g. looking up all branches for a given repository. Essentially what I was missing was a pure-python productivity layer on top of pygit2 -- something that makes it easy to get an overview quickly. To illustrate how this might work, I sketched the following piece of code:

https://github.com/esc/pythonicgit2/blob/master/pythonicgit2.py

And this reminds me, in principle, of the example shown in this pull-request. However, I am not convinced yet, that this pythonic layer needs to be written as a C-extension necessarily.

jdavid commented 11 years ago

I agree. One idea is to write a low-level API written in C, where there is about one Python wrapper for every libgit2 function; and then write the high level stuff in Python. The obvious place to start is the repository, by sub-classing it.

For instance Repository.create_reference wraps two libgit2 functions, I would rather have two Python wrappers and write create_reference in Pyhon.

jdavid commented 11 years ago

Commit 9ffc14148e46 introduces a Repository class written in Python.

cholin commented 11 years ago

The Repository class gets bloated with tons of methods (mixed up low and high level) with this approach, but well I think it is a trade-off we can live with. In my opinion we should mark the low-level api with an underscore (method name). So people can easily distinguish between low and high level.

As it seems it's important to have the same api of libgit2 in python I would prefer prefer to use the same function names for the low level api as well or at least similiar naming convention (reference_create_symbolic or create_reference_symbolic instead of create_symbolic_reference)

esc commented 11 years ago

I concur: mapping the libgit2 API directly to python using the same function/method names is a good idea.

For connecting high and low level repository objects you can either use inheritance or composition (facade pattern). The disadvantage of inheritance is, that you get a bloated class which may expose much low level functionality. Using an underscore may be one way to mitigate this (maybe even by monkey patching at runtime). The disadvantage of composition is, you may need much boilerplate to forward and transform the high-level calls to low-level calls. On the other hand, you are cleanly separating low- from high-level.

wking commented 11 years ago

On Sun, Mar 03, 2013 at 07:26:38AM -0800, Valentin Haenel wrote:

Using an underscore may be one way to mitigate this (maybe even by monkey patching at runtime).

-∞ for monkeypatching namespace fixes ;).

I don't see a problem with class bloat here. Command line Git has many high and low level APIs in the same flat namespace. The way to distinguish between the two is with documentation (for stable APIs). Use underscores for anything that is too unstable to be worth maintaining (a pygit2-level decision), not for stuff that is too low level to be useful (an-external project-level decision).

jdavid commented 11 years ago

Problem is there are methods which are too simple to wrap. For instance revparse_single, if we decide to use lookup_object instead, then the Python code would look like:

class Repository(_Repository):

    def lookup_object(self, refspec):
        return self._revparse_single(refspec)

What I think does not provide any value. I would rather use the lookup_object name directly in the C extension.

Maybe talking about low/high level APIs is wrong. Maybe it is just about mixing Python and C code to better handle complexity.

Anyway, one thing I am concerned about is having a clearer criteria for the API design, and for the coding conventions. So at the end pygit2 is consistent, and not a collage.

jdavid commented 11 years ago

My example above is likely wrong. There is a value to implement (in Python) a cache for Git objects, so the lookup_object method would actually be "smarter" than that.

One advantage of having a low-level API where one Python method maps to one libgit2 function, is to make it brain-dead to implement new features. If we go full that way and decide to keep the same name, then we could keep the git prefix too, so revparse_single would actually be git_revparse_single, that would reduce namespace clutter.

wking commented 11 years ago

On Sun, Mar 03, 2013 at 10:19:51AM -0800, J. David Ibáñez wrote:

One advantage of having a low-level API where one Python method maps to one libgit2 function, is to make it brain-dead to implement new features.

Brain dead wrapping + intelligent Pythonization sounds good to me.

esc commented 11 years ago
Brain dead wrapping + intelligent Pythonization sounds good to me.

+2

cholin commented 11 years ago

Hmm in my opinion it's not that easy. For example for iterators/generator we often use multiple git functions in each iteration step. But let's give it a try...

bors-ltd commented 11 years ago

Have you looked at CFFI for wrapping the C API with little effort? https://cffi.readthedocs.org/en/release-0.6/#examples

esc commented 11 years ago

I only recently discovered cffi and to my knowledge there is no attempt to python wrap lingit2 with it. There is however an experimental cygit wrapper at:

https://github.com/sjagoe/cygit2

jdavid commented 11 years ago

I did not knew about cffi, nor was aware that there is an active effort to develop cygit bindings.

There are also the glib bindings, https://git.gnome.org/browse/libgit2-glib

carlosmn commented 10 years ago

After spending a couple of days fighting with python's memory system, I took a look at CFFI.

It's pretty neat, and it does make wrapping the C easier, but performance is worse than with the current codebase (though it is faster with pypy than cpython). I've been doing some performance measurements, and a walk of the git repo takes twice as long. Looking at the cProfile output, it looks like we're spending as much time creating our objects as doing the walk, which is disappointing.

There's still some work left to be done perf-wise (we should be able to avoid copying data and creating objects in a few places) but we'd have to look at whether we're willing to take the performance hit for the ease of writing and extending the bindings.

jdavid commented 10 years ago

Thanks @carlosmn to take the time to experiment with CFFI and look at the performance.

In my opinion twice as long is too much. If we were starting pygit2 now we may take a different path. But we have not gone this far to switch now to CFFI, and give up that much on performance.

carlosmn commented 10 years ago

I have made a couple of changes to avoid making extra copies and whatnot, which brings the times of a walk down git.git + extracting the author to (rough measurements in seconds)

git_revwalk_next() pygit2 pygit2-cffi
0.710 0.960 1.420

So current pygit2 adds about 0.2s to the raw time from the walk, and the cffi version 0.7. I'm still optimistic about improving the performance.

carlosmn commented 10 years ago

Here's a fairer comparison, with the Linux repository. Instead of counting just the time for the walk, the C version also looks up the object, which is something we do in pygit2 (427782 commits, in seconds, with a warm filesystem cache).

git-log C pygit2 pygit2-cffi pygit2-pypy pygit2-cffi-pypy
7.4 8.8 9.8 12.4 10.8 10.1

So it's not twice as slow, but it is about 25-35% slower, which is a shame.

Even though pypy needs to simulate refcounting, which it doesn't do internally, and other things to pretend that it's CPython, over the long run the jitter makes up for it. Using cffi on CPython means running more python code than with "pure" pygit2, so CPython with cffi does come last. Where the jitter shines is with cffi, as we do end up running a lot of native code, which is what we're doing by compiling the C code in pygit2.

These figures should be taken with a pinch of salt, as they are in many ways a micro-benchmark, that's simply testing how fast we can create python objects and pump data out of libgit2. A "real" application would be doing much more with this data, such that the differences would become more noisy.

The benchmark is essentially

repo = pygit2.Repository('linux/linux')
for commit in repo.walk(repo.head.target, 0):
    i += 1

del repo
jdavid commented 10 years ago

Concerning the decision to make, I give little weight to the pypy numbers, because it is not the mainstream implementation of Python we should optimize for.

What matters to me here is the 26.5% difference between pygit2 and pygit2-cffi, what still looks like too much. Of course that's half the picture, the other half is: how much simpler is the code with cffi?

carlosmn commented 10 years ago

The code with cffi is both much smaller in size as well as simpler, most of the time we can simply proxy a call to the libgit2 function we want. We avoid all of the type definition boilerplate and documentation macros.

We also get to write python instead of C for any decisions we need to make, which also simplifies some expressions.

jdavid commented 10 years ago

I gave a quick look at your cffi branch to see what CFFI looks like (and it looks good). There are a some fixable details, like it drops support for Python 2.6; but better to focus on the core subject for the decision to make.

For instance, the new hash(oid) function looks to me like much slower than the current one, since the oid is very likely to be used as a key in a dictionary, this is important. Just an example.

To the point, I am +1 to mix CFFI and a c extension. CFFI would be used to implement new stuff, and why not to rewrite current features where performance is not an issue. I do think CFFI would be of great help to fill the gap between libgit2 and pygit2, this is to say: to implement the missing features and to stay up-to-date to changes in libgit2.

Thanks again @carlosmn for this effort, you have been the driving force behind pygit2 for a while now.

carlosmn commented 10 years ago

Walking the commit graph is probably the operation where we do the most back-and-forth between libgi2 and pygit2, and where we create the most objects in a short time, so this is where the largest differential would be found (unless you're iterating over all refs in a huge gerrit project).

I would expect a benchmark of the index operations would show that the difference in performance would be small enough for the reduced code size (especially as it's C code) to be worth it.

The hash(oid) was a quick version I wrote without much looking at speed, just that it would return something sensible. I'm not sure how bad loss of 2.6 support is. Even 2.7 is pretty old by now.

Mixing cffi and our custom C extension seem reasonable, though I wonder how tricky it would be. I don't think we currently use pure-python classes from within the extension, but we can work around that. If/When we merge this, we should definitely be implementing new features with cffi first and convert to custom C code if it gets painful performance-wise. Stuff that's IO-bound like remotes shouldn't need any C code on our side.

jdavid commented 10 years ago

I would like to see this implemented not as a huge PR, but a functional block at a time, this would help to better visualize the changes and simplify review. Yes the remote code looks like a good candidate to start.

(I would put the ffi code in a separate C file, just to enable syntax highlight in the editor.)

jdavid commented 10 years ago

Regarding support for Python 2.6 I am +0, because it is still widely deployed and takes little effort to support it. On the other hand this is a volunteer effort, so if nobody cares neither do I.

carlosmn commented 10 years ago

I'll see about writing remotes with cffi then. I'm not sure what you mean about the C file. All the C code we'd need should be a string with contents of a pseudo header for the stuff we want to import. Should that be in a file we read in at load time?

jdavid commented 10 years ago

Yes, that's what I mean, a open('ffi.c').read(). Just to allow the editor to syntax highlight the C code, and so make it nicer to read. Unless there is a reason to do it with an string literal...

carlosmn commented 10 years ago

I've gone with that approach, and it loading it at runtime works fine, other that I am having some problems with making the file get copied to where the code can find it.

pygit2/decl.h does not get copied into the build/lib.... directory, and adding package_data or data_file entries does not seem to help, so I can't get the CI to work with it. The otherwise-working code in cffi-remote in my fork, if you'd like to take a look to see what can be done there.

jdavid commented 10 years ago

Just pushed a cffi-remote branch with your changes, and added a commit to install decl.h, it works for me, tell me if it does for you.

The unit tests pass with Python 3. But with Python 2 I get 9 errors like:

Traceback (most recent call last):
  File "/home/jdavid/sandboxes/pygit2/test/test_remote.py", line 221, in test_update_tips
    remote.fetch()
  File "/home/jdavid/sandboxes/pygit2/build/lib.linux-x86_64-2.7/pygit2/remote.py", line 188, in fetch
    raise self._stored_exception
ValueError: <_cffi_backend.buffer object at 0x13eb280>
carlosmn commented 10 years ago

I was using pygit2/decl.h as a path instead of just decl.h. Your fix works fine here.

Not sure where those errors come from yet. I see them with pypy but not with CPython 2.7.

jdavid commented 10 years ago

Time to close this one in my opinion. The largest change since this discussion started has been the partial move cffi. Further discussion on this topic should start fresh with a new issue.