pkolano / mutil

Multi-threaded cp and md5sum based on GNU coreutils
https://pkolano.github.io/projects/mutil.html
Other
117 stars 13 forks source link

Update for newer coreutils version? #2

Open biocyberman opened 1 year ago

biocyberman commented 1 year ago

Hi, could you produce patch for newer coreutils versions? We are using Ubuntu 20.04 and 22.04 and they both use newer coreutils versions (8.30 and 8.32 respectively).

pkolano commented 1 year ago

I do plan on creating a version for the final 8.x release (8.32) at some point but it has been low priority thus far. Last I checked, I didn't see any significant enhancements/vulnerabilities for cp or md5sum over the current 8.22 branch that would impact 8.22 viability (e.g. like the sparse file handling improvements from 7.x to 8.x). If you are aware of any, please let me know so I can increase priority accordingly. You should only use the generated cp and md5sum in any case since the changes have not been tested with any other coreutils program so don't think running 8.22 cp/md5sum in parallel with 8.3x for the rest should affect anything. I will try to look at the 8.32 code to see how much work is involved in updating it.

biocyberman commented 1 year ago

I will take a look as well. I have just overcome a problem with patching the v8.22 version. I used the git source code that has not been run with boostrap and the patch failed. Later I found out that it is crucial to download the packed source from: https://ftp.gnu.org/gnu/coreutils/coreutils-8.22.tar.xz.

For future reference, the failed patch looks like this:

coreutils git:(338446102) patch -p1 < ../mutil/patch/coreutils-8.22.patch
patching file COPYING
patching file Makefile.in
Hunk #1 FAILED at 1.
Hunk #2 succeeded at 290 (offset -13 lines).
Hunk #3 succeeded at 680 (offset -10 lines).
Hunk #4 FAILED at 761.
Hunk #5 FAILED at 815.
Hunk #6 FAILED at 906.
Hunk #7 succeeded at 1349 (offset 268 lines).
Hunk #8 succeeded at 1380 (offset 268 lines).
Hunk #9 succeeded at 1755 (offset 290 lines).
Hunk #10 succeeded at 2725 (offset 293 lines).
Hunk #11 FAILED at 2981.
Hunk #12 succeeded at 3542 (offset 293 lines).
Hunk #13 succeeded at 3764 (offset 293 lines).
Hunk #14 succeeded at 3801 (offset 293 lines).
Hunk #15 FAILED at 4446.
Hunk #16 FAILED at 5304.
Hunk #17 succeeded at 5717 (offset 293 lines).
Hunk #18 FAILED at 5541.
Hunk #19 succeeded at 5919 with fuzz 2 (offset 293 lines).
Hunk #20 FAILED at 5778.
Hunk #21 FAILED at 6444.
Hunk #22 FAILED at 6480.
Hunk #23 FAILED at 6605.
Hunk #24 FAILED at 6647.
Hunk #25 FAILED at 6661.
Hunk #26 FAILED at 6675.
Hunk #27 FAILED at 6689.
Hunk #28 FAILED at 6703.
Hunk #29 succeeded at 7016 with fuzz 2 (offset 299 lines).
Hunk #30 FAILED at 6776.
Hunk #31 FAILED at 7314.
19 out of 31 hunks FAILED -- saving rejects to file Makefile.in.rej
patching file aclocal.m4
Hunk #1 FAILED at 1.
Hunk #2 FAILED at 32.
Hunk #3 FAILED at 51.
Hunk #4 succeeded at 415 with fuzz 2 (offset -3 lines).
Hunk #5 FAILED at 532.
Hunk #6 FAILED at 540.
Hunk #7 FAILED at 652.
Hunk #8 succeeded at 833 with fuzz 2 (offset 79 lines).
Hunk #9 succeeded at 1434 (offset 79 lines).
6 out of 9 hunks FAILED -- saving rejects to file aclocal.m4.rej
patching file configure
Hunk #1 succeeded at 1902 (offset 25 lines).
Hunk #2 succeeded at 2013 (offset 29 lines).
Hunk #3 succeeded at 2687 (offset 38 lines).
Hunk #4 succeeded at 2707 (offset 38 lines).
Hunk #5 FAILED at 4051.
Hunk #6 succeeded at 4867 with fuzz 2 (offset 279 lines).
Hunk #7 succeeded at 6241 with fuzz 2 (offset 715 lines).
Hunk #8 FAILED at 6005.
Hunk #9 succeeded at 8528 with fuzz 2 (offset 255 lines).
Hunk #10 succeeded at 25805 (offset 321 lines).
Hunk #11 succeeded at 55377 (offset 262 lines).
2 out of 11 hunks FAILED -- saving rejects to file configure.rej
patching file gnulib-tests/Makefile.in
Hunk #1 FAILED at 1.
Hunk #2 succeeded at 507 (offset 7 lines).
Hunk #3 succeeded at 3527 (offset 180 lines).
1 out of 3 hunks FAILED -- saving rejects to file gnulib-tests/Makefile.in.rej
patching file lib/config.hin
Hunk #1 succeeded at 1429 (offset -6 lines).
Hunk #2 succeeded at 1510 (offset -6 lines).
Hunk #3 succeeded at 1627 (offset -6 lines).
Hunk #4 succeeded at 1687 (offset -6 lines).
Hunk #5 succeeded at 1765 (offset -9 lines).
Hunk #6 FAILED at 2803.
1 out of 6 hunks FAILED -- saving rejects to file lib/config.hin.rej
File m4/gnulib-common.m4 is not a regular file -- refusing to patch
1 out of 1 hunk ignored -- saving rejects to file m4/gnulib-common.m4.rej
patching file m4/mutil.m4
patching file src/copy.c
patching file src/copy.h
patching file src/cp.c
patching file src/extent-scan.c
patching file src/extent-scan.h
patching file src/local.mk
patching file src/md5sum.c
patching file src/mutil-q.h
patching file src/mutil.c
patching file src/mutil.h
patching file tests/cp/backup-is-src.sh
biocyberman commented 1 year ago

I had to compile the patched version inside a Ubuntu Trusty (14.04) docker container with CLFAGS="-w -I/usr/include/mpi". Haven't tested running the binaries in Ubuntu 20.04 after the compilation. Anyway, this is certainly not straightforward. So, a newer patch version is needed.

After looking a bit deeper into the code, I think it would be easier to sync with upstream coreutils if you:

  1. create a branch from the coreutils git repo: https://github.com/coreutils/coreutils.
  2. Check out the correct branch and apply the patch.
  3. Although it may still require manually checking, cherry-pick or rebase to newer releases will be doable.

In essence I am suggesting to create a fork of coreutils and sync with the upstream. Putting in on github will attract people to help maintaining the fork. We can than just clone the fork and compile. This has an extra advantage of saving people from patching the code, which involves more work.

With the fork, the output binary can also be changed to mcp and mmd5sum to avoid overwriting by accidental installation.

biocyberman commented 1 year ago

@pkolano I am trying to bring over the changes you made for v8.22 to v8.30. The biggest challenge is with copy.c, more specifically functions sparse_copy and extent_copy . The differences between copy.c on version 8.22 and 8.30 are too great for me. I don't understand the code yet so I don't know how to merge. Another approach is to walk up the releases one by one (e.g. bring v8.22 to v8.23 then v8.24, and so on) but it takes too much time to reach v8.30.

pkolano commented 1 year ago

yes, copy.c is where the bulk of the functionality resides so is always the most time-consuming to port. As far as build problems, I will see if I can reproduce on a newer system as I very rarely build from scratch so it's certainly possible a dependency became incompatible somehow. As you found, it is always assumed you will patch against the official release tarball as no other coreutils source has been validated in any way.

I think your suggestion of adding an already patched source tree is reasonable and doable although I would want coreutils as a subdirectory rather than taking over the top level. I don't know, however, if such a structure would complicate/negate any of the benefits you were talking about (I'm not a big git user so not totally familiar with all its features/limitations).

As I said, I'm not opposed to doing the work to create an updated version, but it does take quite a bit of effort, especially on the validation side. A massive quantity of data has been copied/summed with the 8.22 branch (I'd estimate 75+ PB for mcp and 250+ PB for msum just at our facility) so I'd rather try to fix build issues on a highly vetted branch than moving to something completely new. As mentioned, however, if there was some kind of significant enhancement/vulnerability fix from 8.22 to 8.32, then it would demand a higher priority. Anyway, I will look at the build.