Open samcday opened 11 years ago
@FrozenCow @lushzero @jmendeth @ben @iamwilhelm - Who's it gonna be fellas? :)
Does it need to be just one?
[...] has some Github projects demonstrating an understanding of Node, and C++ [...]
I have some overall experience in it, I've mostly written Robotskirt (Node.JS wrapper for the now-frozen Sundown library), as you can see in the contributors page; from that I derived the V8U set of microutilities for writing V8 addons. And some other small modules, like parport.js (high-level access to parallell ports from Node).
But I wouldn't like to hold that responsibility alone.
PS: @ben is already a maintainer. :)
I have experience with C++ and I have experience with Node/JS. However I don't have any experience writing C++ addons for Node (apart from the PR I did here), so I don't think I'm fit for the job.
Does it need to be just one?
I agree, 2 or more would be best.
However I don't have any experience writing C++ addons for Node (apart from the PR I did here)
you'll get experience from gitteh! :D
@samcday That's good to hear. I'd hate to see it languish. I have a vested interest in making sure gitteh is up to date and up to par, as I use it in cubehero.com. However, I haven't written a node native module before, I'm not against spending the time to learn, however. I think as one of the maintainer, I'd probably start off doing things like making sure documentation is up to date, and making sure PRs and issues don't just sit there.
I think as one of the maintainer, I'd probably start off doing things like making sure documentation is up to date, and making sure PRs and issues don't just sit there.
That's the spirit! It's clear that we need several maintainers. :smiley:
I'm afraid I have to bow out for the maintainer role. I'm a bit overcommitted just now, and more projects are coming down the pipeline. I'll keep watching this, and feel free to summon me for libgit2 questions.
I'm afraid I have to bow out for the maintainer role. I'm a bit overcommitted just now, and more projects are coming down the pipeline. I'll keep watching this, and feel free to summon me for libgit2 questions.
I understand. But don't worry, we'll still @ summon you. ;)
I'm pretty much out on gitteh at this point and working on sgit. To me a lot of the fundamental technological decisions made in gitteh were sound at the time they were made but not now. Libgit2 is very different now so they don't make sense anymore, i.e. C++ async, thread handling and thread locking which is 70% of gitteh's code.
@lushzero Fair enough, I'm not sure I'd want someone working on gitteh if they make statements like that without backing it up with proper reasoned arguments ;)
@jmendeth @FrozenCow @iamwilhelm Thanks for indicating interest. I wrote up the initial issue pretty fast, and didn't mean for it to sound like it precludes more than one active maintainer. What I meant by it is I need someone (or someones, hehe) to be involved in gitteh enough so that it keeps moving along, as it's not really something I realistically have time for. That said, I plan on remaining involved, and may be able to contribute some more features over the coming weeks / months. We'll see.
@iamwilhelm I didn't actually realise you were using gitteh in anger somewhere like that. That's pretty cool! Gitteh is actually something I wrote for a project I was working on a while ago, but I've since scrapped that project and I think that's one of the biggest reasons why I haven't had much time for gitteh.
@ben Understandable!
Okay so the guys who are interested, sounds like we need to divvy up some responsibilities. Here's a quick rough cut of what might work:
Whatcha think?
Yeah, that sounds great to me.
But I think that, before trying to start working, we should set a simpler build system (i.e. remove coffeescript), so that we (and future contributors) can focus better on it.
There's also things we can simplify on gitteh's code, now that LibUV is standard in every Node.
Since @lushzero was working on a fork, out of interest I thought I'd look around whether there were other git libraries for node based on libgit2. I found https://github.com/nodegit/nodegit/tree/wip . Does anyone know anything about that project? They seem to do things very similar to what node-gitteh does and still active. I'm not sure how far along they are, but they do have treebuilder stuff in there, which was something I wanted to put in gitteh too.
With sgit, there are 3 projects doing the exact same thing with slight in their API and the way it works. I'm a bit torn at what to do here. Focus our efforts on one project? Keep developing multiple projects side-by-side? Collaborate a bit? I'm still interested in the new methods @lushzero is suggesting, because the current way to implement methods in gitteh is quite hard as a beginner. With the code being in a better state, it would be easier for people to contribute.
That being said (back on-topic), I don't mind being a maintainer, though I do not know how long it will be of interest for me. I'm working on a project for myself that uses gitteh and for that I'm adding features that are needed, but I don't know how long that project will last. I'll likely lose interest/become less active once that project dies. So, short-term: yes. long-term: I'm not sure.
@lushzero @FrozenCow @samcday
My thoughts:
@FrozenCow About your interest matter, I have to say: this is what happens to everyone. :smile: Even @samcday admits he once lost interest (and now Gitteh seems interesting again). I also wanted to do a rewrite and actually started it (and developed it for some time), and it has some functionality, but now it sits there, abandoned.
So, that "interest" problem is normal on everyone, what (I think) it's important here is that we inform others about it before leaving. Other than that, I don't see any problem in being a maintainer.
PS. Sorry for my ugly English.
I thought I might introduce myself as I'm now making active contributions to nodegit. I'm the author of the wip
branch https://github.com/nodegit/nodegit/tree/wip
The wip branch is a rewrite of nodegit from the ground up. It uses codegen to generate the node bindings automatically, from a JSON description of the API. The wip branch is, I believe, already more comprehensive than both nodegit/master and node-gitteh. IMO, codegen is a better approach than writing bindings by hand because it requires less manual work to add support for a new function -- and libgit2's API is huge! Other advantages include: a bug fixed in one place is a bug fixed in all places, etc.
This wip
version isn't exactly quick and low-level, although it follows this simple[istic] design philosophy: it aims to implement something like a 1:1 mapping between the node API and the C API, with a layer added in pure Javascript that is more "convenient". For example, there is a wrapper around the treebuilder to make it simple to make deep (nested) changes to trees https://github.com/nodegit/nodegit/blob/wip/example/new-commit.js and there are event emitters for walking history https://github.com/nodegit/nodegit/blob/wip/example/walk-history.js.
All of that "convenience" is implemented in Javascript so people can make nodegit
as friendly as possible without having to write C++ -- and at the same time, since nearly 100% of the libgit2 functionality is available to them, people are no longer blocked waiting for someone else to implement some function (like clone or treebuilder) in C++.
Even the 1:1 "raw" API is not as bad as it sounds. It's still object-oriented in style, and is asynchronous exactly and only where libgit2 (potentially) does blocking IO:
git.Repo.open(path.resolve(__dirname, '../.git'), function(error, repo) {
if (error) throw error;
repo.getCommit('59b20b8d5c6ff8d09518454d4dd8b7b30f095ab5', function(error, commit) {
if (error) throw error;
commit.getEntry('README.md', function(error, entry) {
if (error) throw error;
entry.getBlob(function(error, blob) {
if (error) throw error;
console.log(entry.name(), entry.sha(), blob.size());
});
});
});
});
Note that in the above, code, nothing is part of the "convience" API -- everything is 1:1 and automatically generated. Preliminary documentation of this raw
API is here: http://www.nodegit.org/nodegit/ . Documentation of the convenience layer is forthcoming (but check the examples at https://github.com/nodegit/nodegit/tree/wip/example )
I'd like to explore the idea of whether we can merge these two projects. That is, whether we can just take the codegen rewrite of nodegit, and once it is stable and well documented, call it libgit2/WHATEVER and deprecate the previous two projects. One project has the benefit of: one implementation, more maintainers, fewer bugs, better documentation.
So that's my hope.
Hi @nkallen,
that explains many things (I remember nodegit
to be a lot hackish and have a raw API, and the results with codegen
really impressed me). However, going from manual to auto is a huge change, and has to be studied accordingly (it's also a big phylosophy change).
You see, Gitteh now wants to remodellate itself and needs some people who care of it (reason of this thread). So I think the best approach would be to let @samcday decide, then wait a bit so we can simplify and get things to stability again, and then we would all debate about the fusion you're proposing ---I'm serious about getting a single project with all our efforts combined.
Conclusion: we need some time to get Gitteh itself to normality, and then we can talk about that change. :)
My thoughts.
@nkallen I find the codegen approach interesting. I guess it sort of leads into the main goals of a binding project and what individual developers want to use it for. I have been making pretty swift progress on node-sgit, https://github.com/DerpKitten/node-sgit, and ended up with something relatively high level. My needs are to be able to create repositories, commit files to them and then get history and revisions of files. What particular codegen tool are you using?
I am really curious as to what everyone else's needs are and would a higher level or lower level approach be a better fit for them?
I do concur that finding someone to combine efforts would make sense if it can be worked out. Although I could definitely see a situation that results in 2 binding modules, one low level and one higher level.
@lushzero Oohh I googled node-sgit
and looked at your account and couldn't found it. I've now catched you! >:)
I am really curious as to what everyone else's needs are and would a higher level or lower level approach be a better fit for them?
I'm also curious as to what benefits node-sgit gives and how much precision it offers over the generated bindings. Performance is also an important matter.
I do concur that finding someone to combine efforts would make sense if it can be worked out. Although I could definitely see a situation that results in 2 binding modules, one low level and one higher level.
Me too.
@jmendeth I don't think this is something that needs me to make a decision on actually! For one, I'm putting gitteh out there for some new people to take ownership of, so my vote means a whole lot less anyway. Second of all, I think @nkallen's approach sounds very promising, as maintaining C++ bindings are tedious and error prone. If the general consensus from those who's opinions I value (basically everyone in this thread) is that a codegen approach is better, then I think that we should go for it. If we can deprecate gitteh and nodegit in favour of a unified codebase, vision and group of contributors, I think that's the best possible outcome for the Node community that just wants to load a couple of commits / blobs and walk a reflog or two :P
If we can deprecate gitteh and nodegit in favour of a unified codebase, vision and group of contributors, I think that's the best possible outcome for the Node community that just wants to load a couple of commits / blobs and walk a reflog or two :P
Don't get me wrong, I just would like to know i.e. the performance impact, modularity change, API compatibility, etc. of such a big change, before jumping into it. But if you guys see it clear, let's go on! :) So what do you suggest?
EDIT: replies going fast, this was in a response to @samcday ;) I agree with what @jmendeth just said.
Sounds good. We could also decide to start in a separate project (like nodegit's wip branch) and move it over under the libgit2 umbrella once we think things are stable/settled. We could also decide to stay api-compatible with gitteh and keep everything under the same name. People who were using gitteh can keep using gitteh that way.
Anyway, this is a pretty substantial decision. I still need to try nodegit's wip branch by running it in an application. That should give me a better idea what to do next.
Sounds good. We could also decide to start in a separate project (like nodegit's wip branch) and move it over under the libgit2 umbrella once we think things are stable/settled. We could also decide to stay api-compatible with gitteh and keep everything under the same name. People who were using gitteh can keep using gitteh that way.
That's one option. The other is to just deprecate this and focus our efforts on nodegit
as @samcday suggested.
If we were to deprecate gitteh, we could write a last gitteh version module that adapts nodegit
's API to Gitteh's.
(that is, a simple JS module that depends on nodegit and uses it to provide Gitteh's API, plus new features).
Anyway, this is a pretty substantial decision. I still need to try nodegit's wip branch by running it in an application. That should give me a better idea what to do next.
:+1:
Could anyone else chime in with their requirements. Mine are pretty simple but were without much or reliable support in any of the existing modules I tested. I settled on a very simple api that is probably closer to the command line git than anything else. Super simple, no nesting, calls are self contained. The result is easy to learn, use, extremely ?fast? and light on memory. https://github.com/DerpKitten/node-sgit/blob/master/example.js . I'm not sure if my use cases are anything like anyone else's though.
This isn't a reflection of the qualitative value of nodegit, just the fact that it is much lower level but the "log" case in nodegit is roughly 38 lines (see API example, https://github.com/nodegit/nodegit) to only 6 in sgit. In both cases the timing is quite fast. Across ten runs against the same .git repo (sgit) with 479 commit log entries I get for sgit: 27541637 median nanoseconds and for nodegit 107501468 median nanoseconds. Sgit is marginally faster (about 70 million nanoseconds) but it's neglible for any real world use and the benchmark methodology is imperfect (process.hrtime). Mostly the difference is in the approach I think, nodegit loops over each, sgit gets them in a large batch.
If someone can provide a working example of the "log" case with gitteh I can time that as well.
To me it's sort of meaningless to talk about performance in abstract without talking about actual numbers and applicable use cases. From my read everything anyone here is talking about is pretty damn fast, there's not all that much road between the JS call, the C++ wrapping and the underlying C library no matter how you cut it. I don't think performance is going to be a big issue for anyone but please explain how that thinking is wrong.
My suggestion is that if we call all agree on the rough sketch of what the node API should look like we can then decide what backend(s) configuration makes the most sense for most of the people most of the time.
Hi guys,
I'm excited you are going to consider the codegen approach as an option. I agree with the commenters above ( @jmendeth , @FrozenCow) that you guys should evaluate it thoroughly before making a commitment. It's still a work in progress, and an outside evaluation is exactly what I need to get the quality to a high level. I'm sure you'll find bugs and have suggestions for improvements, but I'm confident that after I incorporate your feedback, you'll find the functionality comprehensive, stable, and performant.
Whoever is willing to volunteer to beta-test it, I'm available to help. The documentation is a bit sparse, as I said, but you can read the examples linked to above and also the source code to this application I built http://gitdb.elasticbeanstalk.com/repos/gitdb/refs/heads/master/tree/ which uses it.
Making the decision to adopt codegen won't be easy. The API will break. Yes, a compatibility layer in Javascript can be written -- and it's "easy" to do so, if incredibly tedious.
If I may "chime in" with some requirements, I think the goal first and foremost should be to support as large a portion of libgit2 functionality as possible. That supports the widest variety of use-cases, obviously. A good reason not to have 3 separate projects is that libgit's API is large, and getting decent coverage is a lot of effort.
For example @lushzero , here is a pretty terse way to do a git log
in nodegit/wip
:
git.Repo.open(path.resolve(__dirname, '../.git'), function(error, repo) {
repo.getMaster(function(error, branch) {
var history = branch.history();
history.on('commit', function(commit) {
console.log('commit ' + commit.sha());
console.log('Author:', commit.author().name() + ' <' + commit.author().email() + '>');
console.log('Date:', commit.date());
console.log('\n ' + commit.message());
}).start();
});
Again, following the same approach: nearly all of libgit2's functionality is directly available in javascript-land, and the API is as friendly as you want by writing wrappers in javascript.
The reason libgit2 doesn't JUST have a simple git log
command is because you might want to do git log a..b
, abort after 100 entries, change the sorting to topological, or any manner of crazy combinations. Thus libgit2 provides git_revwalk_new
, git_revwalk_push
, and git_revwalk_next
. And similarly, nodegit/wip
now provides all of that to you in javascript land too -- and ALSO the simple, friendlier branch.history.on()
, which is implemented in pure javascript. That is surely easier writing a bunch of friendly code in C++, for example as you have here: https://github.com/DerpKitten/node-sgit/blob/master/src/sgit.cc#L130 .
That said, your approach has, at least in theory, higher throughput because it crosses the runtime boundary less often and it has less context switching (because fewer items are put in the libuv event queue). On the other hand, your approach doesn't interleave well. It hogs the libuv thread while it looks up all of the commits -- each of which is potentially a random access on disk -- which means that a concurrent application (like a web server) will have extremely latency variance as disproportionately large jobs clog up your thread pool.
In summary, I think libgit2's API strikes the right balance of small, compose-able tools that provide maximum flexibility. Making it directly available in Javascript (in an OO way) provides the greatest functionality and flexibility. The only downside of the 1:1 mapping is that it means crossing the runtime barrier and context switching more often -- potentially a performance hit. However, I'm confident I can address performance issues as they arise, and my preliminary benchmarks suggest that preloading data/minimizing runtime boundary crossings provides a minimal benefit:
It hogs the libuv thread while it looks up all of the commits -- each of which is potentially a random access on disk -- which means that a concurrent application (like a web server) will have extremely latency variance as disproportionately large jobs clog up your thread pool.
Exactly. That's the big inconvenient I see in node-sgit
.
@nkallen Looking at the generated (because they're autogenerated?) nodegit C files, I see it uses Boost to escape values, joins them into JSON and passes that to the JS side. It'll be interesting to see the performance of that against what a manually written module such as Gitteh. :)
I must have some benchmarker around, will try to adapt it to nodegit.
The generated code doesn't use boost. See https://github.com/nodegit/nodegit/blob/wip/src/revwalk.cc
This uses a custom code generator so the code can generate any style of C++
On Sep 2, 2013, at 4:15 AM, Xavier Mendez notifications@github.com wrote:
@nkallen Looking at the generated (because they're autogenerated?) nodegit C files, I see it uses Boost to escape values, joins them into JSON and passes that to the JS side. It'll be interesting to see the performance of that against what a manually written module such as Gitteh. :)
Had some benchmarker around, will try to adapt it to nodegit.
— Reply to this email directly or view it on GitHub.
@iamwilhelm I didn't actually realise you were using gitteh in anger somewhere like that. That's pretty cool!
@samcday thanks! Gitteh did cause me to burst veins sometimes, when the lib didn't install or build on npm, but that's over now. Re: documentation How did you plan on generating the documentation? Did you have a Cake task somewhere, or have a particular preference? Should we move the documentation talk to a different issue, and leave this thread for handover talk?
@jmendeth I don't think this is something that needs me to make a decision on actually! For one, I'm putting gitteh out there for some new people to take ownership of, so my vote means a whole lot less anyway.
@samcday I think we look to you, since you're the current maintainer, and you actually have an informed opinion of what might work for gitteh. You may not have to make the final decision, but you will probably have to guide the decision, or else the project will languish from indecision.
When it comes to needs and requirements, my needs are the following:
1) It has the higher level functions (what git refers to as porcelain). I'm not clear on the distinction between what everyone calls the higher and lower level functions right now, since I'm not as familiar with the libgit2 library, but I think it's important to have the main API match the commands you'd use on the command line, day to day--though I don't mind walking trees and commits myself. That will help new users wrap their head around how to use the library more easily. Any more than that is icing on the cake for me. That said, I do see the value in having coverage of the entire libgit2 lib, so you can implement your own stuff.
2) The API should be relatively stable. I know we're still at 0.17.x and not 1.0, but like others before me, I find no pleasure in building on top of shifting sands. That said, as long as we agree on something, and we're not changing things every couple of months, that's fine by me.
3) I hadn't thought about the issue with libuv and great variance in response times. But now that ya'll mention it, yes, it would be bad for me to have high variance in response times when using the lib. Fast git lib with low response time variance is something I'd get behind. But we'd need to benchmark the two different approaches.
I'm not clear on the distinction between what everyone calls the higher and lower level functions right now [...]
I can shed some light on this. The "Goldilocks" level of abstraction for libgit2 is to provide the building blocks you'd need to implement a full git client (command-line, visual, or otherwise), without actually being opinionated on how that client is structured. So a feature like clone is appropriate for this level, but parsing command-line arguments is not.
If you check the diff for that last link, you'll see that we ended up merging the command-line parsing stuff into the examples
area, which is a great place to look if you want to know how to use libgit2, but a built binary is free from all that stuff.
The API should be relatively stable. I know we're still at 0.17.x and not 1.0 [...]
This is sort of the nature of libgit2 right now, I'm afraid. We've tried to get all the breaking changes out of the way, but there are no guarantees just yet. After we hit 1.0, we intend to fully commit to SemVer; we won't break the API without changing the major version number.
I understand. But don't worry, we'll still @ summon you. ;)
@jmendeth, I'm counting on it. :grinning:
The generated code doesn't use boost.
o_O You're right, I don't know where did I see that...
@iamwilhelm I'm pretty much inline with this view, I'm looking for an API much closer to the CLI git usage than the actual plumbing of libgit2.
After doing some additional testing one of my concerns with the codegen approach seems to be true. It opens the doors to a lot of memory leaks. Node itself is not awesome in that regard, but I don't see how without a lot of manual coding on the C++ bindings and without additional attention on the JS side in relation to those you can keep that under control and that sort of defeats the purpose of using codegen in the first place.
A cursory review with valgrind that should not be taken as gospel shows in the log example I noted earlier that nodegit leaks:
1 run: ==19285== LEAK SUMMARY: ==19285== definitely lost: 58,013 bytes in 1,738 blocks ==19285== indirectly lost: 218,421 bytes in 5,173 blocks ==19285== possibly lost: 512 bytes in 3 blocks ==19285== still reachable: 43,163 bytes in 50 blocks ==19285== suppressed: 0 bytes in 0 blocks
50 runs: ==19323== LEAK SUMMARY: ==19323== definitely lost: 2,092,324 bytes in 53,720 blocks ==19323== indirectly lost: 9,883,880 bytes in 246,865 blocks ==19323== possibly lost: 707,256 bytes in 12,646 blocks ==19323== still reachable: 48,326 bytes in 541 blocks ==19323== suppressed: 0 bytes in 0 blocks
versus the sgit approach to the same log function:
1 run: ==19297== LEAK SUMMARY: ==19297== definitely lost: 660 bytes in 46 blocks ==19297== indirectly lost: 8,558 bytes in 10 blocks ==19297== possibly lost: 512 bytes in 3 blocks ==19297== still reachable: 42,117 bytes in 38 blocks ==19297== suppressed: 0 bytes in 0 blocks
50 runs: ==19303== LEAK SUMMARY: ==19303== definitely lost: 4,872 bytes in 93 blocks ==19303== indirectly lost: 26,580 bytes in 452 blocks ==19303== possibly lost: 0 bytes in 0 blocks ==19303== still reachable: 4,727 bytes in 33 blocks ==19303== suppressed: 0 bytes in 0 blocks
It's not a completely fair head to head comparison given that the two approached are diametrically opposed but it's the reason I chose the approach I did in sgit. My use case can't survive substantive memory leaking. It's not that the nodegit approach can't be fixed either, it's that it will take as much effort as simply writing quality C++ bindings in the first place and exposing release functions.
That I also see as a problem though, node and JS in general does not have a lot of convention to handle the manual release of resources. I think that will be a problem to the extent that any references and pointers to libgit2 items are brought all the way into the JS side. I can see a few cases were very limited use would make sense but I don't see how to avoid doing it wholesale in the codegen approach. I'm generally a big fan of "automatic" bindings and was hopeful here but the data has convinced me. I'm a huge fan of SWIG and such but I just don't think it makes sense in the node/git case or at least as it relates to my use of it. If there is a compelling counter-argument I'm all for hearing it.
Libgit2 has roughly 500 functions of which I could potentially see about 30 needing direct access from the JS side. Maybe even as few as a dozen. So to me my investigation leaves a couple relatively straightforward technical items that can determine how best to proceed and combine efforts.
I'm not only advocating for sgit, I'd be just as happy to contribute to nodegit. Bottom line for me I need a library that can create repos, commit to repos and pull revisions/history (log) ASAP and I need it not to leak memory. I don't really care how that gets accomplished and would prefer a solution that has multiple contributors. On the performance level I think all the solutions here can be made to perform more than adequately for normal use cases.
Hmm, I wonder what is causing the memory leak since both sgit and nodegit are leaking. Could you share the test-case?
It seems that sgit doesn't do asynchronous calls, since it's using v8 types inside the same function where libgit2 is called. If that's indeed the case, I think it's skewed to compare the two projects at the level of memory or performance. Asynchronous calls are vital for node libraries, even though they do cost more cpu/memory. From what I can tell nodegit frees its libgit2 objects once the garbage collector of v8 cleans up its objects (in the destructor of its ObjectWrappers). I could see this as being one potential reason why you're seeing memory leaks: v8 might not delete all objects cleanly if the process is killed before v8 cleans up the garbage. It seems nodegit does do its freeing consistently for all its objectwrappers, but only if a libgit2-free function is defined. This sounds like a good choice: it assures all objects are freed correctly.
It might be a good idea to test the memory footprint by walking the tree a great number of times and give the GC some time to clean up the garbage. I'm also interested in whether valgrind also shows memory leakage when creating a great bunch of objects in node without using any nodegit/sgit/gitteh libraries.
Anyway, maybe we should do this memory leak discussion on nodegit's issue tracker.
That all being said, I do think it's better to have a minimal layer between node and libgit2. That assures us we can use all functionality that's in libgit2. The code generation allows keeping the bindings up-to-date with libgit2 more easily and assuring everything works somewhat the same (no forgotten free's for example). We can focus on getting these low-level bindings stable. On top of the low-level bindings we can build highlevel convenience functions in JS (like clone). If we only implement the functions that we have use-cases for now, you'd probably see people needing to implement their own needed bindings themselves and do pull requests for those. It worked for gitteh, but it requires a lot of time for outsiders to get familiar with the code and do things right. It would be better for them to contribute convenience functions in JS, since that's much easier, less error-prone and results in less disastrous situations when a mistake is made (segfaults instead of an exception in JS that can be caught).
My use-cases mostly requires creating bare repositories, creating commits (based on previous commits/trees), retrieving history, retrieving trees, retrieving files, but all without an index/working directory. That's short-term and I can use what's in gitteh for now. Long term I see use in merging, pushing/pulling and preferably streaming blobs (for saving and retrieving files directly, instead of buffering things first in memory).
@FrozenCow I would urge you to conduct your own tests, it's pretty straightforward. My examples were using the nodegit one from it's api readme (Git Log Emulation) and the log example for node-sgit. The command line I was using for valgrind was "valgrind --leak-check=full --show-reachable=yes node test.js " The nodegit example does not appear to use async (on the JS side) and I was using sgit without async either. The scripts did nothing but run the example in a for loop and neither printed the results out, only the api calls were made. I ran them against the node-sgit repo which is roughly 500 commits.
As I understand it the node/v8 GC should not be depended upon to cleanup external resources -- see https://code.google.com/p/v8-juice/wiki/CommentaryOnV8#No_GC_guaranty . In the test cases I pointed out I'm pretty sure the GC wouldn't be able to cleanup those items anyway. For node, my point is that you're going to have to write a bunch of C++ cleanup code -- or even a borderline custom "GC", unless you want to push all the release work to the JS end user -- to address the fact that the node GC may never run. That's just as much work for 12 to 30 api_calls as just writing those non-automatically (not codegen) to begin with.
I see your use cases encompass perhaps 12 or less api calls as well. Is anyone else in favor of a non codegen approach? My mind is made up at this point, I'm happy to let others contribute to nodegit/gitteh/etc but neither meets my needs today and I forsee months of work to iron out issues that seem apparent to me. I'd be happy to start adding other peoples api needs to sgit if you're willing to test and contribute. If you can point to a roughly working example of your need in C, it can be added to sgit in about 15 minutes.
As I understand it the node/v8 GC should not be depended upon to cleanup external resources -- see https://code.google.com/p/v8-juice/wiki/CommentaryOnV8#No_GC_guaranty . In the test cases I pointed out I'm pretty sure the GC wouldn't be able to cleanup those items anyway. For node, my point is that you're going to have to write a bunch of C++ cleanup code -- or even a borderline custom "GC", unless you want to push all the release work to the JS end user -- to address the fact that the node GC may never run. That's just as much work for 12 to 30 api_calls as just writing those non-automatically (not codegen) to begin with.
So, I'm not getting you... the V8 garbage collector will free most JS objects, unless they have been explicitely persisted in the C++ side using a Persistent
handle. These objects need to be freed manually by calling ->Dispose()
.
Other than that, yes, the GC should be expected to remove any V8 objects which have no references.
If you aren't doing those two things correctly, you are leaking. That's it.
@lushzero, you were using this code to test sgit? Just to confirm.
It might be a good idea to test the memory footprint by walking the tree a great number of times and give the GC some time to clean up the garbage.
You don't need to do that. You can call Node with the --expose-gc
flag and then call gc()
at any point in your code to clean everything.
I'm also interested in whether valgrind also shows memory leakage when creating a great bunch of objects in node without using any nodegit/sgit/gitteh libraries.
I'm also interested, since I looked extensively at node-sgit's code and couldn't find any leaks in the log function.
Ideal process would be:
require
the module.gc()
& measure.gc()
and measure again.@jmendeth I was using the code you identified with the exception that I was pointing to an instance of the node-sgit repo instead of the test sandbox repo. I'm not sure valgrind is the absolute best tool to use for memory leak detection in node, something like this is probably better: https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ but I'm not versed in it.
Node has a tendency to leak memory nearly everywhere for many reasons, but not least of which is closures. Here is perhaps the most famous example: https://github.com/vvo/node/commit/e138f76ab243ba3579ac859f08261a721edc20fe
V8 has a fundamental tradeoff in that the GC is NOT called until there is memory pressure. You need to make sure that long lived code will not unnecessarily "leak" prior to the GC being called or you end up with a very disadvantageous performance profile I would refer to as "boom and bust". The effect is exaggerated on larger systems as the scale of memory beings to greatly outstrip the speed of the cpu's.
I think you are oversimplifying the complexities of memory management a bit across a stack that encompasses C, C++ and V8 (node) see https://developers.google.com/v8/embed as a start. I would say as a general observation that if you are depending on the node GC to cleanup after your addon module you are going to have a bad time.
V8 has a fundamental tradeoff in that the GC is NOT called until there is memory pressure.
Yes, similar to Java's GC.
I think you are oversimplifying the complexities of memory management a bit across a stack that encompasses C, C++ and V8 (node) see https://developers.google.com/v8/embed as a start.
I may have explained myself bad, but I had already read that page several times and I understand it (or at least I think so!) correctly. Yes, V8 values are freed when they have no references to them, and by "references" I mean JS and C++ handles. Local handles are deallocated when the scope exits, so you shouldn't have problems with that. Persistent handles have to be Dispose
d manually. And any other value (which is not managed by V8) won't be destructed [directly].
So if a single call to my API creates lots of V8 handles but doesn't leak them (i.e. they have no extra references that avoid them being GC'd) everything should be okay?
I would say as a general observation that if you are depending on the node GC to cleanup after your addon module you are going to have a bad time.
Does that mean that I should for example have a free
method on Repository
objects instead of relying on GC to notify me when they've been removed?
Correction: I explained myself bad again. V8 values are ready to be freed on the next GC round when they have no references pointing at them.
I think I've "lost sight of the forest for the trees", so please forgive a few questions.
There is no question that there are a few memory leaks in the generated code -- just as there is no question there are a few bugs -- this is a work in progress! So please bear with me. I have some valgrind data that I'll share below, but I wanted to make sure I understood a few of your criticisms first.
Debugging memory leaks isn't super-obvious in node.js -- you definitely need to ensure that you call the garbage collector global.gc()
explicitly if you're using valgrind. I'm not sure, @lushzero , whether the data you pasted about (thanks for that!) above takes this into account or not. I'm also not clear if you're running your test on the generated code nodegit/wip
or just nodegit/master
-- the example from the README that you cite doesn't make sense for nodegit/wip
since the API has changed.
It's not categorically different to do memory management in generated versus hand-written code. The generated code generates free()
s and so forth. You have to configure the system to put them in the right place, just as you have to remember to put them in the right place when you're coding by hand. I would argue that it's in fact easier, because the generator will, once you configure it to free for a certain class of cases -- do this automatically all over the place.
@lushzero you are totally right that garbage collected runtimes have "boom and bust". It's the same in the JVM. However, this is a fundamental "limitation" of node.js -- I don't see how you can get around it. You can call free()
on the libgit2
objects -- but when you pass your JSON hashes as strings in https://github.com/DerpKitten/node-sgit/blob/master/src/sgit.cc#L215
Local<Value> argv[2] = { Local<Value>::New(String::New(err)), Local<String>::New(String::New(stmp.c_str())) };
-- those strings need to be gc'd by v8, and if the user parses those strings with JSON.parse()
-- well that too needs to be GC'd by v8. If you're going to pass data to javascript land, you're going to produce garbage.
Another thing I'm unclear about is the 'async' issues you refer to, @lushzero . As you know there is no 'locking' or 'threading' in javascript -- this only applies to C++ bindings. libgit2 does blocking IO, and thus many functions need to be made async (i.e., pass the baton to the libuv threadpool). They need to be async because node.js is a single-threaded event-loop; if you do blocking IO you will cause disastrous performance. Currently, sgit
lacks the thread scheduling code that would be necessary for it to be production ready, but I'm sure it's easy for you to add.
As a philosophical note, I'm interested in providing the majority of libgit2's functionality. I am building a library, like libgit2 itself, where as as @sam said,
"The "Goldilocks" level of abstraction for libgit2 is to provide the building blocks you'd need to implement a full git client (command-line, visual, or otherwise), without actually being opinionated on how that client is structured."
This is not possible with a few dozen functions. This requires several dozen functions. I understand why you think you don't need this, but if you were to actually consider all the things that the git command line tool can do (like rebasing, merging with different strategies, creating tags, manipulating the index, making commits without an index, and on and on) you would realize you need a lot of functionality! The libgit2 people aren't just adding features for no good reason. True, 20% of the libgit2 functions are C-idiomatic, like allocfmt
, but 80% are useful in javascript land.
I'm also curious why you said, @lushzero ,
My mind is made up at this point
Isn't it a bit premature to have already made up your mind when you (like me!) are still learning about memory management and concurrency in node.js, when your preliminary valgrind tests may not be accurate for reasons pointed out by @jmendeth , and when you haven't seen whether it takes a long time to fix memory leaks and bugs in generated code or not. Maybe it'll take forever! Maybe it'll take me a day or two. Are you sure you have enough information to make up your mind?
In conclusion I'd like to share some valgrind data. Check out this gist: https://gist.github.com/nkallen/6430565 -- this provides a repeatable valgrind testcase. You'll note, after 20 runs:
==26064== LEAK SUMMARY:
==26064== definitely lost: 1,216 bytes in 48 blocks
==26064== indirectly lost: 8,192 bytes in 1 blocks
This isn't so bad is it? And as far as I can tell, none of these "leaks" have to do with nodegit. They're internal caches of libgit2 and objects that are internal to node.js.
But I did fix a couple memory leaks today! Check out https://github.com/nodegit/nodegit/commit/7f9f3c42d58a13f54ce9481652b346fab4217e7b and https://github.com/nodegit/nodegit/commit/79f111573fe9308fd847e96e1873662ed597411a
I think the thing to take away from these commits is fixing memory leaks is not only fairly easy, but it can actually be done systematically: fix it in one place, fix it in many places.
@jmendeth I don't want to sound like an expert on V8 memory management because I'm definitely not. Also the terminology here becomes a complete mess with a word like reference which means one thing in C, another in C++ and another in V8. My point about "boom and bust" may not apply to node 0.10 as incremental deallocation seems to be employed in some circumstances now as of 0.10 or maybe even 0.8.
In short closures, wrapped objects and failure to release memory allocated inside the addon all make it really hard and in many cases impossible to depend on the GC for deallocation.
Valgrind is not the be all and end all here, it sounds like the tool I pointed to in my last post is much more of a gold standard for node. Valgrind just tells me enough to say that there appear to be some very difficult problems in the codegen approach with regard to memory management.
Does that mean that I should for example have a free method on Repository objects instead of relying on GC to notify me when they've been removed?
if you aren't calling the various git_repository_free(), git__free, etc, then definitely yes those have to be called somewhere. Resources don't deallocate themselves GC or not. Having a manual free method on the JS side is possible but very un-node like. This is my point about having to create almost a, this is not really an appropriate term but I lack a better one: "virtual GC", in the addon code. I guarantee that will take more time and energy then simply coding a dozen or two api functions in C++.
@nkallen Move the for loop to encompass the git.Repo.open call and move the global.gc call to run at the end of each loop and that will match my numbers. The gc call doesn't make much difference. I tried both the regular and wip branch, wip was better but not by a whole lot. It's just my opinion but I would posit that this quote "...fixing memory leaks is not only fairly easy, but it can actually be done systematically: fix it in one place, fix it in many places." will haunt you to the ends of the earth ;) I recall a long time ago saying something like that about a project and it haunts me still.
My mind is made up in that I need to have something passable soonish and I can easily see chasing edge cases for a long time. Yes the benefit to codegen is that you "fix it in one place, fix it in many places" but then you also need to start branching out to address edge cases and pretty soon the code complexity is through the roof.
If ligbit2 was a v1 final well established library and you said, let's use codegen once and patch on top of the result I don't really see a problem with that. But libgit2 is due for plenty of changes itself and I fear how that is going to mesh with a codegen approach. You said "... but 80% are useful in javascript land." regarding the libgit2 api calls. I think that is madness, I can see maybe 3-5% being reasonable to expose to JS/Node. Imagine if zlib exposed it's 90 odd api calls instead of the two dozen it does in node. That just doesn't fundamentally make sense to me, Node/JS is a very high level language.
We can agree to disagree about the level of abstraction. The command line git has 22 options including help, everything else is parameterized. That would be a rough approximation of the maximum API I would be looking at in sgit, mostly flat with parameterized options.
I think you meant another @sam.
-Sam
On Sep 3, 2013, at 6:27 PM, Nick Kallen notifications@github.com wrote:
I think I've "lost sight of the forest for the trees", so please forgive a few questions.
There is no question that there are a few memory leaks in the generated code -- just as there is no question there are a few bugs -- this is a work in progress! So please bear with me. I have some valgrind data that I'll share below, but I wanted to make sure I understood a few of your criticisms first.
Debugging memory leaks isn't super-obvious in node.js -- you definitely need to ensure that you call the garbage collector global.gc() explicitly if you're using valgrind. I'm not sure, @lushzero , whether the data you pasted about (thanks for that!) above takes this into account or not. I'm also not clear if you're running your test on the generated code nodegit/wip or just nodegit/master -- the example from the README that you cite doesn't make sense for nodegit/wip since the API has changed.
It's not categorically different to do memory management in generated versus hand-written code. The generated code generates free()s and so forth. You have to configure the system to put them in the right place, just as you have to remember to put them in the right place when you're coding by hand. I would argue that it's in fact easier, because the generator will, once you configure it to free for a certain class of cases -- do this automatically all over the place.
@lushzero you are totally right that garbage collected runtimes have "boom and bust". It's the same in the JVM. However, this is a fundamental "limitation" of node.js -- I don't see how you can get around it. You can call free() on the libgit2 objects -- but when you pass your JSON hashes as strings in https://github.com/DerpKitten/node-sgit/blob/master/src/sgit.cc#L215
Local<Value> argv[2] = { Local<Value>::New(String::New(err)), Local<String>::New(String::New(stmp.c_str())) };
-- those strings need to be gc'd by v8, and if the user parses those strings with JSON.parse() -- well that too needs to be GC'd by v8. If you're going to pass data to javascript land, you're going to produce garbage.
Another thing I'm unclear about is the 'async' issues you refer to, @lushzero . As you know there is no 'locking' or 'threading' in javascript -- this only applies to C++ bindings. libgit2 does blocking IO, and thus many functions need to be made async (i.e., pass the baton to the libuv threadpool). They need to be async because node.js is a single-threaded event-loop; if you do blocking IO you will cause disastrous performance. Currently, sgit lacks the thread scheduling code that would be necessary for it to be production ready, but I'm sure it's easy for you to add.
As a philosophical note, I'm interested in providing the majority of libgit2's functionality. I am building a library, like libgit2 itself, where as as @sam said,
"The "Goldilocks" level of abstraction for libgit2 is to provide the building blocks you'd need to implement a full git client (command-line, visual, or otherwise), without actually being opinionated on how that client is structured."
This is not possible with a few dozen functions. This requires several dozen functions. I understand why you think you don't need this, but if you were to actually consider all the things that the git command line tool can do (like rebasing, merging with different strategies, creating tags, manipulating the index, making commits without an index, and on and on) you would realize you need a lot of functionality! The libgit2 people aren't just adding features for no good reason. True, 20% of the libgit2 functions are C-idiomatic, like allocfmt, but 80% are useful in javascript land.
I'm also curious why you said, @lushzero ,
My mind is made up at this point
Isn't it a bit premature to have already made up your mind when you (like me!) are still learning about memory management and concurrency in node.js, when your preliminary valgrind tests may not be accurate for reasons pointed out by @jmendeth , and when you haven't seen whether it takes a long time to fix memory leaks and bugs in generated code or not. Maybe it'll take forever! Maybe it'll take me a day or two. Are you sure you have enough information to make up your mind?
In conclusion I'd like to share some valgrind data. Check out this gist: https://gist.github.com/nkallen/6430565 -- this provides a repeatable valgrind testcase. You'll note, after 20 runs:
==26064== LEAK SUMMARY: ==26064== definitely lost: 1,216 bytes in 48 blocks ==26064== indirectly lost: 8,192 bytes in 1 blocks This isn't so bad is it? And as far as I can tell, none of these "leaks" have to do with nodegit. They're internal caches of libgit2 and objects that are internal to node.js.
But I did fix a couple memory leaks today! Check out nodegit/nodegit@7f9f3c4 and nodegit/nodegit@79f1115
I think the thing to take away from these commits is fixing memory leaks is not only fairly easy, but it can actually be done systematically: fix it in one place, fix it in many places.
— Reply to this email directly or view it on GitHub.
@lushzero Here is the valgrind output for the loop moved outside the opening of the Repo. gc in the same place --where it should be. https://gist.github.com/nkallen/6431776 As you can see there is essentially no change. No memory leaks.
==27613== definitely lost: 1,216 bytes in 48 blocks ==27613== indirectly lost: 8,320 bytes in 2 blocks
As far as I can tell, you have made a mistake in your profile. At this point, you must admit that nodegit/wip
is not leaking memory for the git log
use-case or -- if not, please provide a repeatable test case in a gist so we can evaluate what's going on.
Again, I don't really understand your arguments against code-gen. You seem to be bringing irrelevant considerations into the discussion. What do closures have to do with codegen versus non? What does freeing locally malloc'd objects have to do with codegen? Everything code-gen malloc
'd is code-gen free
'd; it's no different than anything hand-malloc
'd is hand-free
d.
And you've provided no evidence nor arguments why wrapped objects don't work. They are garbage collected just as every other object in the js world is; they are in fact a standard way to do library bindings in the node.js world.
I'm not sure what your "haunting" experience has to do with this case. You seem to be promulgating fear rather than an argument.
@nkallen Here is substantially similar to the original code which I did not save. Rerunning it just now produces results very close to what I posted earlier. https://gist.github.com/lushzero/6432272
The exact command line I used last was: valgrind --leak-check=full --show-reachable=yes node --nouse_idle_notification --expose_gc gist.js > log2 2>&1
I feel I've adequately expressed my concerns regarding the codegen approach and why I will be pursuing something else. You seem to be taking this personally and that is in no way my intention. I'm essentially betting that the headaches from upstream libgit2 changes and chasing a variety of edge cases including memory management in the codegen case will be a lot more work than coding a flat set of self contained API calls manually, that's it. Also we don't seem to agree at all on what the user facing API should look like, fair enough.
In the simple example I tested that reflects one small element of my immediate use case nodegit leaked between 100K and 2.5MB depending on the branch and other parameters (such as using the non wit branch readme api example code) in a basic 50 loop cycle of the script. "Leak" mine as well be substituted with "used" for my purposes depending on the resolution or the need to depend on the GC. I may not have been choosing the optimal example code but I went with what was posted on the nodegit readme's which is why I chose that example to begin with.
I'm not saying that it can't be fixed, I see you did commit several patches for memory management already today, I'm just saying that amongst the other reasons I've noted above I don't see nodegit as a workable solution to the needs I have. Node-gitteh also is not a good fit as the binding code is a little clumsy, it lacks needed functionality and seems not to have been vetted for memory leaks or consumption/use either. Finally let me say I started yesterday with hope that nodegit would be a good fit for my needs but my own testing, however flawed, led me elsewhere.
If you are insisting that there aren't any memory issues in nodegit as it stands so be it, everyone can run their own benchmarks and decide for themselves, which is what I suggested in the first post in this thread.
Valgrind appears not to be a good tool to hunt and resolve memory leaks in a node application. I used it because it was handy. Most people seem to think https://github.com/lloyd/node-memwatch is the best one. I feel as if this thread has gone wildly off track and a good deal of that is my fault for which I apologize. I would suggest further discussion of memory specifics and benchmarking be moved to a thread on the nodegit issues list or a new thread in this forum.
I apologize for taking this personally. I hadn't had enough to eat and my temper was acting up.
FWIW, there is a bug in your benchmark and so you are being mislead by your data. You have the following code:
for(i=0;i<50;i++) {
git.Repo.open('/root/lushzero/node-sgit/.git', function(error, repository) {
...
}
global.gc();
}
The place you put the global.gc()
is incorrect. git.Repo.open
, etc. is asynchronous, meaning global.gc()
runs 50 times BEFORE you even open the repo. This is the reason you're seeing what looks like a memory leak. But it isn't a memory leak. If you were to do this:
process.on('exit', global.gc)
You'd ensure that the garbage collector runs just before valgrind, which is the only way to get accurate data out of valgrind.
For future benchmarks, please skip any versions of nodegit other than the wip
branch. wip
is a complete rewrite. It is the only version that uses codegen. Older versions of nodegit are most similar to node-gitteh in style. And finally, if this assuages any of your concerns, the wip
codegen is generated from a slightly modified version of libgit2
s documentation. That is, it takes a modified version of libgit2's own docurium specification of the API and generates code from that (see https://github.com/nodegit/nodegit/blob/wip/v0.18.0.json ). This approach, you will note, is resilient to dramatic API change.
@nkallen @lushzero
So I coded a little program that will asynchronously run that
git log emulation test on the last wip
nodegit. It'll use global.gc
and memwatch to take snapshot diffs and display them beautifully
on your console. You'll be able to see what type is exactly leaking.
It has a variable number of iterations.
You should totally try it out: https://gist.github.com/jmendeth/6434819
In my test, the repository is only opened one time. Everything else is repeated. I think I'll change it to constantly open and close the repository, but I think this is a good place to start.
I have tried to put console.log
s around to make sure they are looping,
also tried putting timeouts so that the GC can clean everything, but
the results are always the same:
before
is always the same.Repo
, RevWalk
, History
, EventEmitter
are
correctly deallocated. Only generic types like Code
, Array
, Object
, appear.At 500 iterations, we see some nodegit-specific types leaking:
Commit: +6 (which is the number of commits in the repo)
EventEmitter: +1
Oid: +1
Repo: +1 (yeah, that makes sense)
RevWalk: +1
You can see the results here: https://gist.github.com/jmendeth/6440106
@jmendeth -- This script is awesome though, thanks! I'd love it if we can continue this discussion over here https://github.com/nodegit/nodegit/issues/84 . As @lushzero rightly mentioned, at this level of technical detail we are derailing the broader discussion.
Some quick observations:
memwatch
detects leaks in js code only; it won't detect missing free
s in the C++ code. Valgrind is still the only way to do that AFAIK.processes only part of the object heap in most garbage collection cycles. This minimizes the impact of stopping the application.
I don't know how to force a full gc, but if you incrementally call gc() as in https://gist.github.com/nkallen/6442106#file-gistfile1-js-L4 , even for 5000 runs there are no nodegit objects left over
it won't detect missing frees in the C++ code. Valgrind is still the only way to do that AFAIK.
Yes. :(
this is simply due to the fact that v8 does partial garbage-collections.
hum, I thought that global.gc()
would force a full garbage-collection;
if that's not true, maybe it's a better idea to use memwatch
's own GC:
To aid in speedier debugging, memwatch provides a
gc()
method to force V8 to do a full GC and heap compaction.
@lushzero, before we can use sgit, it first needs to be fully asynchronous. If it is not, weird behavior will crop up and comparing it to other libraries wouldn't be fair (whether we're comparing memory, cpu, complexity does not matter). Or am I still missing something? Is sgit already async? I was under the impression that v8 objects couldn't be used outside of v8's main thread, so I presumed it was not.
Anyway, I've dabbled a bit with nodegit a bit. It seems to work pretty well, although I have found some minor issues (documentation slightly wrong and one usecase not correctly implemented), but all in all I'm pretty impressed. I like that it's just a 1-to-1 mapping of libgit2, which means you can just lookup libgit2's documentation if you want to have details about a particular function. It might also be a good idea to include those mappings into the documentation where possible.
I haven't used all functions that I used for gitteh yet, but for now I'm very positive about the current work. Compared to gitteh it already has more up-to-date documentation, it is easier to find the function you need and it was easy to find the function that was causing a bug and fix it. This is mostly due to the 1-to-1 mapping and the fact that there is documentation (although still minimal and wrong in places).
I had an epiphany this week: I've been selfish when it comes to Gitteh.
I've acknowledged several times in issues / emails that I don't really have enough time for Gitteh. But then, like most developers, I'm presented with a problem of some kind and want to solve it. Then suddenly Gitteh seems interesting again, and I feel all inspired and start making some promises to do things that I generally don't.
Today, I think node-gitteh holds the most promise out of most git libraries for Node. Here's why:
I think the design choices made for gitteh are sound, there's been some quibbling regarding some implementation choices (such as CS), but I think these can be addressed in the short-term easily enough.
SO. What I need is for someone to step up to the plate. I need someone who wants to become the primary maintainer for gitteh.
Please mention your interest here, and we'll hash it out. The person I pick to be the primary maintainer will ideally be someone who's shown a healthy amount of interest in gitteh over the past few months and has some Github projects demonstrating an understanding of Node, and C++ (read: best candidate would be someone who's written a Node native module before). Once I find the right person, I will ensure they are given contributor access to this repository.