Closed cholin closed 10 years ago
It definitely makes sense to hide the C details behind objects that feel native to python. There's a couple of details:
Repository.history
be? As it's branches that have history (and that's what you usually care about), would Repository.history['master']
give you a rev walker primed with master's tip?Repository.lookup()
take a revspec, but keep Repository[]
be for objects.Repostiory.references
, Repository.remotes
et al. should be strings. Otherwise you're going to end up loading a ton of stuff in memory.This might just be implicit in your proposal, but it'd be nice to have the types in there as well.
We certainly need a more friendly and consistent API.
HEAD^2 in repo
does not make much sense to me.refs = list(repo.references)
or refs = [ x.name for x in repo.references ]
if that is what you need.blob = Blob()
and then write them in the repo.ref = Reference(...)
, that should be forbidden as with anything else.repo.TreeBuilder()
to repo.create_tree()
, though you could not get rid of the write()
at the end, since you need to populate the tree before writing it.You are right for the efficiency part (in-memory object creation), but I do not like this exclusive way of creating objects (especially trees). I think there has to be a more consistent solution. As well pygit2 is the only binding which forbids object instantiation.
Can you show how it looks object instantiation in rugged, for instance?
repo = Rugged::Repository.new("path/to/repo")
oid = Rugged::Blob.create(repo, "a new blob content")
repo = Rugged::Repository.new("path/to/repo")
ref = Rugged::Reference.create(repo, "refs/heads/unit_test", "refs/heads/master")
repo = Rugged::Repository.new("path/to/repo")
entry = {:type => :blob,
:name => "README.txt",
:oid => "1385f264afb75a56a5bec74243be9b367ba4ca08",
:filemode => 33188}
builder = Rugged::Tree::Builder.new
builder << entry
sha = builder.write(repo)
I don't know Ruby, so I will learn a little from the code.
For blobs and references I don't see any sensible difference between Ruby and Python. There is not a Rugged::Blob.new(....)
, and the repo must be passed at the creation time:
# Ruby
oid = Rugged::Blob.create(repo, "a new blob content")
# Python
oid = repo.create_blob("a new blob content")
If we wanted pygit2 to look more like Ruby we would use an static method (though it is longer and in my opinion ugly):
oid = pygit2.Blob.create(repo, "a new blob content")
Regarding the tree builder, there is a difference in when the repo is passed: at the beginning (Python) or at the end (Rugged). I do not have an strong opinion here.
In any case, the tree builder is fundamentally different from blobs, references, ... in that it needs to be constructed progressively before it is written.
On Fri, Nov 23, 2012 at 05:20:25AM -0800, cholin wrote:
Hmm I think
lookup_object()
is much more convenient and for the user it doesn't matter what function we call internally. There is a ongoing discussion about refactoring the api in #139.
While I agree that a clean Python API is more than a bunch of C-API wrappers, I think that it's worth minimizing unnecessary differences to keep the mental overhead of translating between the APIs small. In this case, Repository.lookup_object()
has the same API as Repository.revparse_single()
. The rename could confuse existing libgit2 users who are learning pygit2 for the first time. In some cases this would be worth it, but I don't understand how this rename makes the method clearer for virgin pygit2.
I my opinion the target users for pygit2 are not libgit2 developers. But maybe I'm wrong with that...
libgit2's api is getting cleaner and I think there will be soon a major release (Version 1.0). So it would be good to have a consistent api in near future. Maybe there are some pygit2 hackers at http://git-merge.com/ in May in Berlin, so we can discuss this there. But I would prefer to come to a decision earlier!
I am (slowly) working on the docs. Now going through every function to add arguments (issue #85), and to use PyDoc_STRVAR for every docstring. If you want to help, you are welcome.
This issue is large. I think it would be simpler to split it by discussing and fixing one topic at a time (index file, references, etc.). I also like to look at the problem through the documentation, so I wanted to get the docs up-to-date first.
PS : I didn't know about the Git merge event, I may be there ...
I have some concrete Ideas for this issue and for #71.
Basically, when I tried pygit2 for the first time, it seemed quite cumbersome to handle simple things, e.g. looking up all branches for a given repository. Essentially what I was missing was a pure-python productivity layer on top of pygit2 -- something that makes it easy to get an overview quickly. To illustrate how this might work, I sketched the following piece of code:
https://github.com/esc/pythonicgit2/blob/master/pythonicgit2.py
And this reminds me, in principle, of the example shown in this pull-request. However, I am not convinced yet, that this pythonic layer needs to be written as a C-extension necessarily.
I agree. One idea is to write a low-level API written in C, where there is about one Python wrapper for every libgit2 function; and then write the high level stuff in Python. The obvious place to start is the repository, by sub-classing it.
For instance Repository.create_reference
wraps two libgit2 functions, I would rather have two Python wrappers and write create_reference
in Pyhon.
Commit 9ffc14148e46 introduces a Repository
class written in Python.
The Repository
class gets bloated with tons of methods (mixed up low and high level) with this approach, but well I think it is a trade-off we can live with. In my opinion we should mark the low-level api with an underscore (method name). So people can easily distinguish between low and high level.
As it seems it's important to have the same api of libgit2 in python I would prefer prefer to use the same function names for the low level api as well or at least similiar naming convention (reference_create_symbolic
or create_reference_symbolic
instead of create_symbolic_reference
)
I concur: mapping the libgit2 API directly to python using the same function/method names is a good idea.
For connecting high and low level repository objects you can either use inheritance or composition (facade pattern). The disadvantage of inheritance is, that you get a bloated class which may expose much low level functionality. Using an underscore may be one way to mitigate this (maybe even by monkey patching at runtime). The disadvantage of composition is, you may need much boilerplate to forward and transform the high-level calls to low-level calls. On the other hand, you are cleanly separating low- from high-level.
On Sun, Mar 03, 2013 at 07:26:38AM -0800, Valentin Haenel wrote:
Using an underscore may be one way to mitigate this (maybe even by monkey patching at runtime).
-∞ for monkeypatching namespace fixes ;).
I don't see a problem with class bloat here. Command line Git has many high and low level APIs in the same flat namespace. The way to distinguish between the two is with documentation (for stable APIs). Use underscores for anything that is too unstable to be worth maintaining (a pygit2-level decision), not for stuff that is too low level to be useful (an-external project-level decision).
Problem is there are methods which are too simple to wrap. For instance revparse_single
, if we decide to use lookup_object
instead, then the Python code would look like:
class Repository(_Repository):
def lookup_object(self, refspec):
return self._revparse_single(refspec)
What I think does not provide any value. I would rather use the lookup_object
name directly in the C extension.
Maybe talking about low/high level APIs is wrong. Maybe it is just about mixing Python and C code to better handle complexity.
Anyway, one thing I am concerned about is having a clearer criteria for the API design, and for the coding conventions. So at the end pygit2 is consistent, and not a collage.
My example above is likely wrong. There is a value to implement (in Python) a cache for Git objects, so the lookup_object
method would actually be "smarter" than that.
One advantage of having a low-level API where one Python method maps to one libgit2 function, is to make it brain-dead to implement new features. If we go full that way and decide to keep the same name, then we could keep the git
prefix too, so revparse_single
would actually be git_revparse_single
, that would reduce namespace clutter.
On Sun, Mar 03, 2013 at 10:19:51AM -0800, J. David Ibáñez wrote:
One advantage of having a low-level API where one Python method maps to one libgit2 function, is to make it brain-dead to implement new features.
Brain dead wrapping + intelligent Pythonization sounds good to me.
Brain dead wrapping + intelligent Pythonization sounds good to me.
+2
Hmm in my opinion it's not that easy. For example for iterators/generator we often use multiple git functions in each iteration step. But let's give it a try...
Have you looked at CFFI for wrapping the C API with little effort? https://cffi.readthedocs.org/en/release-0.6/#examples
I only recently discovered cffi and to my knowledge there is no attempt to python wrap lingit2 with it. There is however an experimental cygit wrapper at:
I did not knew about cffi, nor was aware that there is an active effort to develop cygit bindings.
There are also the glib bindings, https://git.gnome.org/browse/libgit2-glib
After spending a couple of days fighting with python's memory system, I took a look at CFFI.
It's pretty neat, and it does make wrapping the C easier, but performance is worse than with the current codebase (though it is faster with pypy than cpython). I've been doing some performance measurements, and a walk of the git repo takes twice as long. Looking at the cProfile output, it looks like we're spending as much time creating our objects as doing the walk, which is disappointing.
There's still some work left to be done perf-wise (we should be able to avoid copying data and creating objects in a few places) but we'd have to look at whether we're willing to take the performance hit for the ease of writing and extending the bindings.
Thanks @carlosmn to take the time to experiment with CFFI and look at the performance.
In my opinion twice as long is too much. If we were starting pygit2 now we may take a different path. But we have not gone this far to switch now to CFFI, and give up that much on performance.
I have made a couple of changes to avoid making extra copies and whatnot, which brings the times of a walk down git.git + extracting the author to (rough measurements in seconds)
git_revwalk_next() |
pygit2 | pygit2-cffi |
---|---|---|
0.710 | 0.960 | 1.420 |
So current pygit2 adds about 0.2s to the raw time from the walk, and the cffi version 0.7. I'm still optimistic about improving the performance.
Here's a fairer comparison, with the Linux repository. Instead of counting just the time for the walk, the C version also looks up the object, which is something we do in pygit2 (427782 commits, in seconds, with a warm filesystem cache).
git-log | C | pygit2 | pygit2-cffi | pygit2-pypy | pygit2-cffi-pypy |
---|---|---|---|---|---|
7.4 | 8.8 | 9.8 | 12.4 | 10.8 | 10.1 |
So it's not twice as slow, but it is about 25-35% slower, which is a shame.
Even though pypy needs to simulate refcounting, which it doesn't do internally, and other things to pretend that it's CPython, over the long run the jitter makes up for it. Using cffi on CPython means running more python code than with "pure" pygit2, so CPython with cffi does come last. Where the jitter shines is with cffi, as we do end up running a lot of native code, which is what we're doing by compiling the C code in pygit2.
These figures should be taken with a pinch of salt, as they are in many ways a micro-benchmark, that's simply testing how fast we can create python objects and pump data out of libgit2. A "real" application would be doing much more with this data, such that the differences would become more noisy.
The benchmark is essentially
repo = pygit2.Repository('linux/linux')
for commit in repo.walk(repo.head.target, 0):
i += 1
del repo
Concerning the decision to make, I give little weight to the pypy numbers, because it is not the mainstream implementation of Python we should optimize for.
What matters to me here is the 26.5% difference between pygit2 and pygit2-cffi, what still looks like too much. Of course that's half the picture, the other half is: how much simpler is the code with cffi?
The code with cffi is both much smaller in size as well as simpler, most of the time we can simply proxy a call to the libgit2 function we want. We avoid all of the type definition boilerplate and documentation macros.
We also get to write python instead of C for any decisions we need to make, which also simplifies some expressions.
I gave a quick look at your cffi branch to see what CFFI looks like (and it looks good). There are a some fixable details, like it drops support for Python 2.6; but better to focus on the core subject for the decision to make.
For instance, the new hash(oid)
function looks to me like much slower than the current one, since the oid is very likely to be used as a key in a dictionary, this is important. Just an example.
To the point, I am +1 to mix CFFI and a c extension. CFFI would be used to implement new stuff, and why not to rewrite current features where performance is not an issue. I do think CFFI would be of great help to fill the gap between libgit2 and pygit2, this is to say: to implement the missing features and to stay up-to-date to changes in libgit2.
Thanks again @carlosmn for this effort, you have been the driving force behind pygit2 for a while now.
Walking the commit graph is probably the operation where we do the most back-and-forth between libgi2 and pygit2, and where we create the most objects in a short time, so this is where the largest differential would be found (unless you're iterating over all refs in a huge gerrit project).
I would expect a benchmark of the index operations would show that the difference in performance would be small enough for the reduced code size (especially as it's C code) to be worth it.
The hash(oid)
was a quick version I wrote without much looking at speed, just that it would return something sensible. I'm not sure how bad loss of 2.6 support is. Even 2.7 is pretty old by now.
Mixing cffi and our custom C extension seem reasonable, though I wonder how tricky it would be. I don't think we currently use pure-python classes from within the extension, but we can work around that. If/When we merge this, we should definitely be implementing new features with cffi first and convert to custom C code if it gets painful performance-wise. Stuff that's IO-bound like remotes shouldn't need any C code on our side.
I would like to see this implemented not as a huge PR, but a functional block at a time, this would help to better visualize the changes and simplify review. Yes the remote code looks like a good candidate to start.
(I would put the ffi code in a separate C file, just to enable syntax highlight in the editor.)
Regarding support for Python 2.6 I am +0
, because it is still widely deployed and takes little effort to support it. On the other hand this is a volunteer effort, so if nobody cares neither do I.
I'll see about writing remotes with cffi then. I'm not sure what you mean about the C file. All the C code we'd need should be a string with contents of a pseudo header for the stuff we want to import. Should that be in a file we read in at load time?
Yes, that's what I mean, a open('ffi.c').read()
. Just to allow the editor to syntax highlight the C code, and so make it nicer to read. Unless there is a reason to do it with an string literal...
I've gone with that approach, and it loading it at runtime works fine, other that I am having some problems with making the file get copied to where the code can find it.
pygit2/decl.h
does not get copied into the build/lib.... directory, and adding package_data
or data_file
entries does not seem to help, so I can't get the CI to work with it. The otherwise-working code in cffi-remote
in my fork, if you'd like to take a look to see what can be done there.
Just pushed a cffi-remote
branch with your changes, and added a commit to install decl.h
, it works for me, tell me if it does for you.
The unit tests pass with Python 3. But with Python 2 I get 9 errors like:
Traceback (most recent call last):
File "/home/jdavid/sandboxes/pygit2/test/test_remote.py", line 221, in test_update_tips
remote.fetch()
File "/home/jdavid/sandboxes/pygit2/build/lib.linux-x86_64-2.7/pygit2/remote.py", line 188, in fetch
raise self._stored_exception
ValueError: <_cffi_backend.buffer object at 0x13eb280>
I was using pygit2/decl.h
as a path instead of just decl.h
. Your fix works fine here.
Not sure where those errors come from yet. I see them with pypy but not with CPython 2.7.
Time to close this one in my opinion. The largest change since this discussion started has been the partial move cffi. Further discussion on this topic should start fresh with a new issue.
With git_revparse_sigle I think we could improve object lookups a little bit. For example
repo['HEAD']
orrepo.lookup_object('HEAD')
would be much more convenient to use instead ofrepo.revparse_single('HEAD')
. At the moment the pygit2 api looks a little bit confusing because we just write bindings for c functions. I would like to have a clean well-structured pythonic Repository class instead of just a blind collection of git_* functions.Something like the following (of course written in c...):
I don't know if it's really necessary to create all objects through
Repository
... As well I don't like the naming differences for object creations:repo.create_X
orrepo.TreeBuilder().write()
.What do you think? Some suggestions?