Problem importing CVS module

YorkZ commented 8 years ago

Hi Ralph Loader,

First of all, thank you so much for this amazing tool, it's really promising, except one thing which is stopping me from being able to proceed.

I'm trying to import a CVS module at work to git repositories. Let's say the module's name is AB+ which contains about 20 sub directories inside a directory called "project", say, project/1-1, project/1-2, ... project/1-20. I was able to import each directory 1-1 to 1-20 individually into 20 git repositories by using:

$ crap-clone :pserver:user@our.server.com:2401/cvsroot/software project/1-N

where N is an integer in the range [1, 20].

However, I was unable to import the module AB+ into a single git repository by:

$ crap-clone :pserver:user@our.server.com:2401/cvsroot/software AB+

and I always get the error:

RCS file name '/cvsroot/software/project/1-1/.donotprune,v' does not start
with prefix '/cvsroot/software/AB+/

I've tried to workaround by using 20 git repositories. But unfortunately that doesn't seem to work well because people always commit changes in different directories (1-1 to 1-20) in one CVS commit. The 20 sub directories can not be treated like 20 independent repositories. But instead, they are closely related and are really one single project.

I would really appreciate it if you would help me with this.

Thanks in advance!

York

rcls commented 8 years ago

Hi York, My guess is that you are using the CVS "modules" file to map the subdirectories into software/AB+. (That would be in /cvsroot/software/CVSROOT/modules).

Is that correct? Is the CVS pubically accessible?

I never wrote code to support the "modules" file in crap-clone, unfortunately.

If all the subdirectories are in one place, then you might get a good enough clone using the real server side directory:

crap-clone :pserver:user@our.server.com:2401/cvsroot/software project

Or you could remove the "CVSROOT/modules" file and clone the entire CVS repo:

crap-clone :pserver:user@our.server.com:2401/cvsroot/software .

(If you end up with the CVS repo cloned, but the directory layout wrong, then "git filter-branch" can be used to modify the git repo).

Let me know if this is any use.

There are a couple of other cvs-to-git converters around, you could see if they do better.

Cheers, Ralph.

YorkZ commented 8 years ago

Hi Ralph Loader,

Thank you very much for your help.

My guess is that you are using the CVS "modules" file to map the subdirectories into software/AB+.

In the file CVSROOT/modules, there is a line referring the module AB+:

AB+         -a  1-1 1-2 1- 3 1-4 1-5 1-6 1-7 1-8 1-9 1-10 1-11 1-12
1-13 1-14 1-15 1-16 1-17 1-18 1-19 1-20

Is the CVS pubically accessible?

Too bad that the repository is highly confidential, and nobody is allowed to disclose anything into public domain. I guess I don't have access to the entire repository either.

If all the subdirectories are in one place, then you might get a good enough clone using the real server side directory: crap-clone

:pserver:user@our.server.com:2401/cvsroot/software project

I've tried to clone "project", but I got the following error:

File 99-35/.donotprune branch CDEFG duplicates branch CDE (1.1.1)
Killing zombie version PATH/TO/FILE1.cpp 1.15
Killing zombie version PATH/TO/FILE2.XML 1.7
Killing zombie version PATH/TO/FILE2.cpp 1.12
cvs: cvs [rlog aborted]: could not chdir to Network: Permission denied
Expected RCS file line, not error

Or you could remove the "CVSROOT/modules" file and clone the entire CVS repo: crap-clone :pserver:user@our.server.com:2401/cvsroot/software .

I don't think I have the access to remove the "CVSROOT/modules" file on the server. But I did try to clone the entire repo and got the error:

RCS file name '/cvsroot/software/CVSROOT/avail,v' does not start with prefix
'/cvsroot/software/./'

There are a couple of other cvs-to-git converters around, you could see if they do better.

So far, crap-clone is the only tool that has worked (partly) for me. I didn't bother with cvs2git because as far as I know, it's requires the local account to the CVS repo which I don't have. I've also tried git cvsimport:

$ git cvsimport AB+

But I got the error:

Initialized empty Git repository in ~/tmp/AB+/.git/ Unknown: E cvs checkout:
`1-1/FILE.rpm.specs' is no longer in the repository at
/usr/local/libexec/git-core/git-cvsimport line 511, <GEN0> line 26818.

However, I was able to clone project/1-1 using git-cvsimport:

$ git cvsimport project/1-1

But it was extremely slow.

I also tried cvsclone which gave me the "1.32 Segmentation fault" error.

I would appreciate it if you could give me some help or suggestion.

Thanks a lot

York

YorkZ commented 8 years ago

FYI

I just tried using another tool called cvsclone, to clone the entire CVS repo under "project" to my local drive. It seemed to work well until to the point when it tried to enter the directory "project/Nework". Here's the error it reported:

cvs rlog: Logging project/Network
cvs [rlog aborted]: could not chdir to Network: Permission denied exit: 1

rcls commented 8 years ago

It looks like you don't have permissions to that part of the repository on the server side. You have a couple of options to work around this, neither especially pleasant:

Clone each directory in project/ separately, and then use git to merge them all together. It should be possible using git-filter-branch and git-stitch-repo. You would have to work out the details yourself though.

Alternatively, you could modify crap-clone (or cvsclone) to modify the 'cvs rlog' command it sends to the server; in my crap-clone.c this looks like:

cvs_printff (&stream,
             "Global_option -q\n"
             "Argument --\n"
             "Argument %s\n"
             "rlog\n", stream.module);

Instead of sending one 'Argument' line for the top level directory (stream.module), send one for each sub-directory that you are interested in. I haven't tried this, you will need to experiment a bit to get it right...

rcls commented 8 years ago

Actually, I wrote a quick hack to let you do this. Try the branch directory-limit from my repo. This adds a command-line option to list the directories you want to clone. So you should be able to do:

crap-clone -d 1-1 -d 1-2 -d 1-3 :pserver:user@our.server.com:2401/cvsroot/software project

and this will include the directories 1-1, 1-2, 1-3 but ignore Network.

YorkZ commented 8 years ago

This is super amazing Ralph. Thank you so much!

I'm current cloning all the projects. It takes a while. I'll let you know the results tomorrow morning after arriving my office.

Thanks again!

York

YorkZ commented 8 years ago

Hi Ralph,

I seem to have successfully cloned all the projects within the module AB+ into a single git repository. However, one thing I've noticed is that each git commit contains only one file. In other words, a single CVS commit has been split into several git commits, one commit per file. I guess this is because CVS doesn't have the changeset concept right? If this is the case, I guess it would not be straightforward to re-assemble the changeset to create one git commit right?

Thanks,

York

rcls commented 8 years ago

Hi York, I use heuristics based on the meta data to try and put things into changesets: if the commit message / author etc are all identical, and the timestamps are not too far apart, then I put combine revisions into a changeset. It sounds like this has gone wrong for you for some reason. Without any access to your CVS repo, it is hard for me to track down. The things I would look at:

Does your CVS repo have any commit hooks making the commit message different per-file?
It is possible the use of CVS commitid is getting things wrong. I suspect that some CVS clients may commit files one-by-one, getting different commitids. I have created an (untested) branch no-commitid to ignore the commitid - see if that makes any difference.

If you want to debug this yourself, the relevant code is in changeset.c:

The create_changesets() function sorts the revisions, and then aggregates into changesets.

It uses the function strings_match() compares author / commitid / branch-name / log-message (and an internal flag). (It is intentional that the strings are compared by pointer : I keep only one copy of each unique string content).

YorkZ commented 8 years ago

Hi Ralph,

I think you've done a really good job grouping things into changesets; looks like you did it correctly! I have checked a few cases carefully and noticed that even though those CVS commits have the same commit messages, the commit timestamps were really different, they've really been committed several times and they really have different CVS commit Ids. In the cases when several files were really committed in one go, they have exactly the same timestamps and CVS commit Ids; and you have put them into one single git commit! I was under the wrong impression because I thought the "Checkin Notice Emails" I received whenever somebody commits something were automatically generated. But turns out they were not. Those checkin notice were actually manually composed by the developers. I apologize for reporting the non-bug.

I'll ask your help when I have new problem.

Thank you very much!

York

YorkZ commented 8 years ago

Hi Ralph,

In case one truly does commit multiple times with the same commit message, I think it's a good idea to combine the consecutive commits into one single git commit. Therefore, I tried your "no-commitid" branch which seems to work. Great job!

On the other hand, I wanted to make sure that:

Only consecutive commits will be combined right?
We combine commits only if they don't have intersection right? For example, if commit 1 includes file "foo" and "bar", but commit 2 includes "foo" and "baz", these two commits will not be combined right? Because I think in this case combining the two commits would lose the history of file foo.

Thanks again,

York

rcls commented 8 years ago

Hi York, That was a lucky guess! I shall merge the commit-id change to master. Your understanding of how file-versions are combined into commits is correct. The algorithm is as follows:

Tentatively group file-versions into commits using branch / date / log-message / author. (this is in changeset.c).
Attempt to put all the commits on each branch into a sequence compatible with the version numbers on each file. (It's a topological-sort of a digraph, see emission.c and heap.c).
If that fails, then I break up commits until the previous step succeeds. (When the topo-sort fails, you can identify a cycle in the digraph, and then break the cycle by splitting a commit in two, see the function cycle_split() in emission.c).

The only case I've seen in real life where 3. is necessary is where my first attempt at building the commits has two versions of the same file. Presumable what happened was that someone committed, fixed a problem immediately, and then committed the fix with the same log message. In theory more complicated inconsistencies in the commit ordering can happen (and the code should cope), but I have not seen this.

YorkZ commented 8 years ago

Hi Ralph,

Thank you very much for your explanation. I just took a quick look into the file changeset.c, and your code looks neat and nice. Amazing job!

I shall merge the commit-id change to master.

Definitely, in my opinion! Maybe add a command line switch for this? Also, don't forget the extremely useful "-d DIRECTORY" option!

I guess after you merges the directory-limit branch, we can close this issue because I'm sure it has been addressed by the new "-d DIRECTORY" option.

Thanks again,

York

YorkZ commented 8 years ago

Importing CVS module to a single Git repository can now be achieved by passing all the directories defined by the CVS module on the command line, using the new "-d DIRECTORY" option.

rcls / crap

Problem importing CVS module #10