rvaughn / cvs-fast-export

CVSNT-to-Git conversion utility
11 stars 4 forks source link

There should be only one cvs-fast-export. #11

Open eric-s-raymond opened 4 years ago

eric-s-raymond commented 4 years ago

There is another cvs-fast-export. It lives here: https://gitlab.com/esr/cvs-fast-export

My cvs-fast-export is couple of years older, but I'm not writing to ask that you rename yours. Rather, I'm offering you the use of my rather exhaustive test suite, and hoping your cvs-fast-export is enough better that I can retire my code.

My implementation is in C. It is very thoroughly field-tested and fast as hell, but nobody understands the clique-analysis code, not even Keith Packard who originally write it. Consequently, there are some bugs in weird cases that I don't think I'm ever going to be able to fix. C is not the best language for this kind of work, anyway.

Can we join forces to figure out if I can deep-six the C code? If yours is up to the job I'd rather join your project than continue mine.

rvaughn commented 4 years ago

Hi Eric. I was unaware of your project and didn't manage to find it before I wrote this one.

TBH, I haven't touched or used this code in years. It's kind of a dead project, but I'd be happy to run your test suite against it. CVS is chock full of corner cases, even assuming maintainers don't pull weird tricks with it. When they do...

My code is pretty fast, at least compared to everything else I tested, but I suspect yours is much faster. I don't think there is a good language for this kind of work, though I know your C version is almost certainly more portable and maintainable than my beginner's Scala.

In short, I'm up for the challenge, though I won't be able to get to it until this weekend at the soonest.

eric-s-raymond commented 4 years ago

Excellent. I also suspect mine is much faster, but I'd cheerfully take a significant performance hit to get correctness in a wider range of cases.

As for the best language for this kind of work, my current choice is Go. Which, if you're not familiar with it, is like...hmmm...statically typed Python that runs at C speed except for occasional short (sub-millisecond) GC pauses.

rvaughn commented 4 years ago

I'm familiar with Go, but I haven't embraced it yet. There may be a rewrite possibility there after the tests are incorporated.

Anyway, let's start at the beginning. I'll adapt and run your test suite this weekend, and we'll see where we go from there.

eric-s-raymond commented 4 years ago

Roger Vaughn notifications@github.com:

I'm familiar with Go, but I haven't embraced it yet. There may be a rewrite possibility there after the tests are incorporated.

Yeah. That'd probably be good for a performabce boost.

Anyway, let's start at the beginning. I'll adapt and run your test suite this weekend, and we'll see where we go from there.

OK. If you want to reach me in real time while you're experimenting you can find me on the ##esr or #reposurgeon channels at freenode IRC. I'll try to be helpful.

rvaughn commented 4 years ago

Early report: There are some big differences in our two projects. For one, I have only the main conversion utility where you have several companion utilities - which your tests also cover. I likely will not be able to cover those in current form, though that may not be important.

My project operates on complete CVS repos or (informal) modules, where yours is file-based. I never even considered that approach, which gives you a little bit of extra flexibility that I don't enjoy.

Unsurprisingly, our option sets are quite different, which hints at different behaviors under the covers. How much effect that has on the output I have yet to determine.

On one test repo I have, your conversion takes 2 CPU seconds, where mine takes 19. Quite a difference! The warning output from the two also hints at differences in graph-building, though again I have not yet analyzed to what extent. However, as I'm sure you're well aware, building a Git graph from CVS commits is a best-guess effort, so they may be different but equally valid. FWIW, my design goal was for each Git commit to exactly reproduce the CVS working state at the imputed commit time. (In other words, the working state as generated by cvs co -D or cvs co -r.)

Next step, actually executing your tests.

I'm going to predict now that our best bet is probably a complete rewrite of my code (my graph builder is fairly easy to understand for the most part) with functionality falling somewhere in the middle of the two projects. Plus that will give me an excuse to do a deep-dive in Go.

eric-s-raymond commented 4 years ago

I likely will not be able to cover those in current form, though that may not be important.

It isn't. cvssync is a wrapper around rsync, and cvsconvert is a wrapper around cvs-fast-export. Both could be moved to your project - or adapted to work with a hypothetical Go rewrite - with little difficulty There's also a Python module used for generating tests called testlifter.py; thartt shouldn't be difficult to port either.

My project operates on complete CVS repos or (informal) modules, where yours is file-based.

Yeah, it turns out the module structure has no semantic freight. And doing it my way gives us lifting of RCS collections for free.

I'm going to predict now that our best bet is probably a complete rewrite of my code (my graph builder is fairly easy to understand for the most part) with functionality falling somewhere in the middle of the two projects. Plus that will give me an excuse to do a deep-dive in Go.

I like the direction of your thinking. Count me in - tests, Go experience, and all. The thought of scrapping my dodgy C code for a clean Go port with a comprehensible clique analyzer makes me happy.

eric-s-raymond commented 4 years ago

If you want to deep-dive into Go working on version-control stuff, you might want to try reading pr at least skimming this Go code:

https://gitlab.com/esr/reposurgeon

This is the tool cvs-fast-export is a front end for. Grokking all its code isn't necessary and would probably be too hard, but I do recommend getting some grasp of how the data structures fit together. There are necessarily going to be very similar ones in whatever we write.