pete4abw / lrzip-next

Long Range Zip. Updated and Enhanced version of ckolivas' lrzip project. Lots of new features. Better compression. Actively maintained.
https://github.com/pete4abw/lrzip-next
GNU General Public License v2.0
50 stars 10 forks source link

Possible undefined behaviour #42

Closed pete4abw closed 3 years ago

pete4abw commented 3 years ago

Discussed in https://github.com/pete4abw/lrzip-next/discussions/41

Originally posted by **demhademha** August 16, 2021 So, i pipe lrzip-next to tar, like this: `tar -cf - bootstrap/* | lrzip-next -vv -z -L9 -p1 -w 80 -o bootstrap.tar.lrz` However, tar produces the following output: ``` Bad checksum: 0x66adc716 - expected: 0xfff82df8 e, check file directly.Decompressing... ``` Any ideas why this is happening? regards
pete4abw commented 3 years ago

CRC may not be written if Mac.

pete4abw commented 3 years ago

I tested with no MD5 and CRC Checksums compute properly and are displayed properly. Make sure you are pulling the latest from either Master or lzma-21.03beta. I am unable to duplicate the error. Please try and provide more detailed system information. Mac support is very old and I don't use one either!

Compression command: LRZIP=NOCONFIG tar -cf - src/ |src/lrzip-next -vvzw80 -L9 -fo /tmp/test.tar.lrz Test Command: LRZIP=NOCONFIG src/lrzip-next -vvt ../test.tar.lrz

Thread 2 decompressed 364333 bytes from stream 13:100%  
Thread 1 decompressed 10485760 bytes from stream 1
Taking decompressed data from thread 1
  0%       0.00 /     25.25 MB
Taking decompressed data from thread 2
100%      25.25 /     25.25 MB
Checksum for block: 0xd609e62c
Closing stream at 2073215, want to seek to 2073215

Average DeCompression Speed:  1.087MB/s
[OK] - 26480640 bytes                                
Total time: 00:00:23.55

ps2lrz output

$ ps2lrz -i ../test.tar.lrz
Showing file info only
../test.tar.lrz is an lrzip version 0.7 file
../test.tar.lrz is not encrypted
../test.tar.lrz uncompressed file size is 26480640 bytes
Dumping magic header 24 bytes
Byte Offset      Description/Content
===========      ===================
Magic Bytes 0-3: 4C 52 5A 49 LRZI
Bytes 4-5:       LRZIP Major, Minor version: 00, 07
Bytes 6-13:      LRZIP Uncompressed Size bytes: 00 10 94 01 00 00 00 00 
Bytes 14 and 15: unused
Byte  16:        LRZIP Filter 0 - None
Bytes  17-21    unused. Not an LZMA compressed archive
Byte  22:        MD5 Sum at EOF: no
Byte  23:        File is encrypted: no
pete4abw commented 3 years ago

Out of curiousity, @demhademha could you please try this with lrzip and its flavors of lrztar and lrzuntar? See if the same error pops up? I just cannot duplicate it here.

demhademha commented 3 years ago

Yes, I am following. I'll try by updating lrzip-next first.

pete4abw commented 3 years ago

Yes, I am following. I'll try by updating lrzip-next first.

@demhademha How are we doing with this?

demhademha commented 3 years ago

Sorry - have been ill. Trying now as we speak

pete4abw commented 3 years ago

Sorry - have been ill. Trying now as we speak

That's too bad. I just want to be able to close this one way or another! Take care.

demhademha commented 3 years ago

The latest test was carried out from the latest github tag: 0.8.3: (I used a different tarball, but the same issue still applies)

The following options are in effect for this DECOMPRESSION.
Threading is ENABLED. Number of CPUs detected: 8
Detected 8589934592 bytes ram
Nice Value: 19
Show Progress
Max Verbose
Temporary Directory set as: /var/folders/gn/98nbmc7569b6xx2bdqhm57bh0000gn/T/
Output filename is: bin.tar
Malloced 2863300608 for tmp_outbuf
Detected lrzip version 0.8 file.
Not performing MD5 hash check
CRC32 being used for integrity testing.
Validating file for consistency...
Detected lrzip version 0.8 file.
Decompressing...
Reading chunk_bytes at 18
Expected size: 113203200
Chunk byte width: 4
Reading eof flag at 19
EOF: 1
Reading expected chunksize at 20
Chunk size: 113203200
Reading stream 0 header at 25
Reading stream 1 header at 38
Reading ucomp header at 51
Fill_buffer stream 0 c_len 828343 u_len 2773542 last_head 0
Starting thread 0 to decompress 828343 bytes from stream 0
Thread 0 decompressed 2773542 bytes from stream 0
Taking decompressed data from thread 0
Reading ucomp header at 828407
Fill_buffer stream 1 c_len 20056328 u_len 67475366 last_head 0
Starting thread 1 to decompress 20056328 bytes from stream 1
Thread 1 decompressed 67475366 bytes from stream 1
Taking decompressed data from thread 1
100%     107.96 /    107.96 MB
Closing stream at 20884747, want to seek to 20884747
Bad checksum: 0x9bd117b1 - expected: 0x162e4b9f
Deleting broken file bin.tar
Fatal error - exiting

as well as this issue, why is the temporary dir /var/folders/gn/? Regards

pete4abw commented 3 years ago

Please include

  1. Output from ./configure. I want to see if assembler modules are included.
  2. full command line to compress, and output.
  3. full command line to decompress, and outrput.

Just showing output only gives me the end of the story ;-).

Thank you for your help on this. I need to make a template for issues.

demhademha commented 3 years ago

Here is the config log as well as the output from make: (Will give the rest in a few moments) https://paste.debian.net/1209428/ Regards

demhademha commented 3 years ago

Here is the compression log (I already provided the command I used to do this with my original issue - I just used a different directory here). Also, I provided the decompression log earlier today

The following options are in effect for this COMPRESSION.
Threading is DISABLED. Number of CPUs detected: 1
Detected 8589934592 bytes ram
Nice Value: 19
Show Progress
Max Verbose
Output Filename Specified: bin.tar.lrz
Temporary Directory set as: /var/folders/gn/98nbmc7569b6xx2bdqhm57bh0000gn/T/
Compression mode is: ZPAQ. LZ4 Compressibility testing enabled
Compression level 9
RZIP Compression level 9
ZPAQ Compression Level: 5, ZPAQ initial Block Size: 11
Heuristically Computed Compression Window: 27 = 2700MB
Storage time in seconds 1377922451
Shrinking chunk to 113203200
Succeeded in testing 113203200 sized mmap for rzip pre-processing
Chunk size: 113203200
Byte width: 4
ZPAQ Block Size reduced to 4
Per Thread Memory Overhead is 33554432
Succeeded in testing 146757632 sized malloc for back end compression
Using only 1 thread to compress up to 113203200 bytes
Beginning rzip pre-processing phase
hashsize = 4194304.  bits = 22. 64MB
Total:  1%  Chunk:  1%
Total:  2%  Chunk:  2%
Total:  3%  Chunk:  3%
Total:  4%  Chunk:  4%
Total:  5%  Chunk:  5%
Total:  6%  Chunk:  6%
Total:  7%  Chunk:  7%
Total:  8%  Chunk:  8%
Total:  9%  Chunk:  9%
Total: 10%  Chunk: 10%
Total: 11%  Chunk: 11%
Total: 12%  Chunk: 12%
Starting sweep for mask 1
Total: 13%  Chunk: 13%
Total: 14%  Chunk: 14%
Starting sweep for mask 3
Total: 15%  Chunk: 15%
Total: 16%  Chunk: 16%
Total: 17%  Chunk: 17%
Total: 18%  Chunk: 18%
Total: 19%  Chunk: 19%
Total: 20%  Chunk: 20%
Total: 21%  Chunk: 21%
Total: 22%  Chunk: 22%
Total: 23%  Chunk: 23%
Total: 24%  Chunk: 24%
Total: 25%  Chunk: 25%
Total: 26%  Chunk: 26%
Total: 27%  Chunk: 27%
Total: 28%  Chunk: 28%
Starting sweep for mask 7
Total: 29%  Chunk: 29%
Total: 30%  Chunk: 30%
Total: 31%  Chunk: 31%
Total: 32%  Chunk: 32%
Total: 33%  Chunk: 33%
Total: 34%  Chunk: 34%
Total: 35%  Chunk: 35%
Total: 36%  Chunk: 36%
Total: 37%  Chunk: 37%
Total: 38%  Chunk: 38%
Total: 39%  Chunk: 39%
Total: 40%  Chunk: 40%
Total: 41%  Chunk: 41%
Total: 42%  Chunk: 42%
Total: 43%  Chunk: 43%
Total: 44%  Chunk: 44%
Total: 45%  Chunk: 45%
Total: 46%  Chunk: 46%
Total: 47%  Chunk: 47%
Total: 48%  Chunk: 48%
Total: 49%  Chunk: 49%
Total: 50%  Chunk: 50%
Total: 51%  Chunk: 51%
Total: 52%  Chunk: 52%
Total: 53%  Chunk: 53%
Total: 54%  Chunk: 54%
Total: 55%  Chunk: 55%
Total: 56%  Chunk: 56%
Total: 57%  Chunk: 57%
Total: 58%  Chunk: 58%
Total: 59%  Chunk: 59%
Total: 60%  Chunk: 60%
Total: 61%  Chunk: 61%
Total: 62%  Chunk: 62%
Total: 63%  Chunk: 63%
Starting sweep for mask 15
Total: 64%  Chunk: 64%
Total: 65%  Chunk: 65%
Total: 66%  Chunk: 66%
Total: 67%  Chunk: 67%
Total: 68%  Chunk: 68%
Total: 69%  Chunk: 69%
Total: 70%  Chunk: 70%
Total: 71%  Chunk: 71%
Total: 72%  Chunk: 72%
Total: 73%  Chunk: 73%
Total: 74%  Chunk: 74%
Total: 75%  Chunk: 75%
Total: 76%  Chunk: 76%
Total: 77%  Chunk: 77%
Total: 78%  Chunk: 78%
Total: 79%  Chunk: 79%
Total: 80%  Chunk: 80%
Total: 81%  Chunk: 81%
Total: 82%  Chunk: 82%
Total: 83%  Chunk: 83%
Total: 84%  Chunk: 84%
Total: 85%  Chunk: 85%
Total: 86%  Chunk: 86%
Total: 87%  Chunk: 87%
Total: 88%  Chunk: 88%
Total: 89%  Chunk: 89%
Total: 90%  Chunk: 90%
Total: 91%  Chunk: 91%
Total: 92%  Chunk: 92%
Total: 93%  Chunk: 93%
Total: 94%  Chunk: 94%
Total: 95%  Chunk: 95%
Total: 96%  Chunk: 96%
Total: 97%  Chunk: 97%
Total: 98%  Chunk: 98%
Total: 99%  Chunk: 99%
2796202 total hashes -- 173335 in primary bucket (6.199%)
Malloced 2863300608 for checksum ckbuf
Starting thread 0 to compress 2773542 bytes from stream 0
lz4 testing OK for chunk 2773542. Compressed size = 64.23% of test size 2773542, 1 Passes
Starting zpaq backend compression thread 0...
ZPAQ: Method selected: 54,92,0: level=5, bs=4, easy=92, type=0
Writing initial chunk bytes value 4 at 18
Writing EOF flag as 1
Writing initial header at 24
Compthread 0 seeking to 9 to store length 4
Compthread 0 seeking to 26 to write header
Thread 0 writing 828343 compressed bytes from stream 0
Compthread 0 writing data at 39
Starting thread 0 to compress 67475366 bytes from stream 1
lz4 testing OK for chunk 67475366. Compressed size = 62.74% of test size 10485760, 1 Passes
Starting zpaq backend compression thread 0...
ZPAQ: Method selected: 54,97,0: level=5, bs=4, easy=97, type=0
Compthread 0 seeking to 22 to store length 4
Compthread 0 seeking to 828382 to write header
Thread 0 writing 20056328 compressed bytes from stream 1
Compthread 0 writing data at 828395
matches=281576 match_bytes=45727834
literals=267502 literal_bytes=67475366
true_tag_positives=14930353 false_tag_positives=14903077
inserts=8812639 match 0.678
Compression Ratio: 5.420. Average Compression Speed:  0.552MB/s.
Total time: 00:03:14.62
pete4abw commented 3 years ago

Please include the command line you used. Also, when you run ./autogen.sh you do not have to specify ./configure as it runs automatically unless you set the variable NOCONFIGURE=1. So you have no assembler module which gives a clue. Get me your compress and decompress command lines and I can review. Probably not today, but tomorrow hopefully.

demhademha commented 3 years ago

Compress command:

tar -cf - bin/* | lrzip-next -vv -z -L9 -p1 -o bin.tar.lrz

Decompress command: lrzip-next -vv -d bin.tar.lrz

pete4abw commented 3 years ago

Excellent. Why do you use -p1. It slows things down terribly. Let lrzip-next get the value.

Now, there is an entry in the ChangeLog that specifically disabled MD5 checking on Apple because it was broken.

APRIL 2011, version 0.602 Con Kolivas ....

  • Disable md5 generation and checking on Apple till it's fixed.

It's 10 years on now. If you don't mind, I want you to edit the file src/include/lrzip_private.h

Change line 175 to # define MD5_RELIABLE (1)

174 #if defined(__APPLE__)
175 # define MD5_RELIABLE (0)
176 #else
177 # define MD5_RELIABLE (1)
178 #endif

ITMT, I think you found a bug where the MD5 hash is getting written instead of CRC regardless of the setting. Thanks!

Let me know about the edit.

pete4abw commented 3 years ago

Also, ITMT, if you undo the previous patch, try this

Change line 1113 and replace the ending zero with a 4

1113         if (ofs >= infile_size - (HAS_MD5 ? MD5_DIGEST_SIZE : 0))
1114                 goto done;
1115         else if (ENCRYPT)
1116                 if (ofs+header_length > infile_size)

Recompile and try. Thank you

demhademha commented 3 years ago

Hi, I'm confused, Would you like me to try just the latest patch you provided, or the one before that as well? Regards

pete4abw commented 3 years ago

Hi, I'm confused, Would you like me to try just the latest patch you provided, or the one before that as well? Regards

You can do both. They are mutually exclusive. The first just tells Apple Mac to compute an MD5 instead of a CRC. The second reserves space at EOF for CRC which is 4 bytes long. If Apple can do an MD5 reliably, then I would not worry about the second patch. If you want to test the CRC do # 2 first.

Also, does this also occur with lrzip? I'm still struggling with why md5 was excluded from the ARM version of lrzip.

pete4abw commented 3 years ago

You can stop testing. I believe the error is in the 7zCrc.c file where I stripped it down to ignore Crc methods not used. I may have taken out some code needed by ARM CPUs.

Since so many users are x86 users, I just neglected the non assembler sources. Thanks for your patience while I try and work this out. I may just regress and go back to earlier versions.

Try checking out the master branch and let me know if that works.

pete4abw commented 3 years ago

Appreciate all your help. Before I commit, please just make this edit as I mentioned before. I really think disabling MD5 for Apple is unnecessary.

In file src/include/lrzip_private.h Change line 175 to # define MD5_RELIABLE (1)


174 #if defined(__APPLE__)
175 # define MD5_RELIABLE (0)
176 #else
177 # define MD5_RELIABLE (1)
178 #endif
``

Then, compile again and run your test compression and decompression commands and be sure to use the `-H` option.

Let me know how this works.
pete4abw commented 3 years ago

@demhademha new branch 21.03Apple_Test is ready for your test. Thank you.

demhademha commented 3 years ago

I'll be sure to test this for you tomorrow

demhademha commented 3 years ago

I'm currently in the process of converting the specified branch into a tar ball

pete4abw commented 3 years ago

To make it easier for you, keep it simple. Don't use the -p option, don't use -L9 and make a tarball of /bin. The routines for CRC and MD5 are independent of compression or decompression options or levels.

demhademha commented 3 years ago

Currently testing, but when running ./autogen.sh, got the following output:

readlink: illegal option -- f
usage: readlink [-n] [file ...]
usage: dirname string [...]
Running autoreconf -if...
sed: 1: "s/^v(.*?-)g(.*)$/\1\2/": RE error: repetition-operator operand invalid
./util/gitdesc.sh: line 58: [: -gt: unary operator expected
sed: 1: "s/^v(.*?-)g(.*)$/\1\2/": RE error: repetition-operator operand invalid
./util/gitdesc.sh: line 58: [: -gt: unary operator expected
sed: 1: "s/^v(.*?-)g(.*)$/\1\2/": RE error: repetition-operator operand invalid
./util/gitdesc.sh: line 58: [: -gt: unary operator expected
#ETC (comment placed by me)

I remember this issue being in lrzip, but then it was fixed - thought I'd let you know regards

pete4abw commented 3 years ago

autogen.sh has not been changed in years. gitdesc.sh was just changed by a MAC user to make it compatible! So I am not sure why this is an issue. However, if you downloaded a tarball, that sed command should not even be executed. @lr4d

As for autogen.sh my guess is that the MAC OS shell is not compatible with the full bash commands.

Try these two file replacements. lrzip-next-replacements.tar.gz

extract and re-run ./autogen.sh

demhademha commented 3 years ago

I can confirm your changes work - the compressed tarball is decompressed

demhademha commented 3 years ago

Another question, what lrzip commands should I use to get the highest compression (regardless of time) - my original compression commands (-z -L9 -p1) compared to just using (-z) produces a difference of only 4mb.

pete4abw commented 3 years ago

Excellent, @demhademha ! Thank you for your patience and help! I will push these changes to master and the main development branch shortly. Now, to answer your question. This Wiki Article, while written for LZMA also applies to ZPAQ. is a good place to start.

Longer description

The principle of lrzip and therefore lrzip-next was to leverage multi-processor CPUs and pass smaller blocks of data to lzma and later zpaq so they could process them faster AND in parallel. And, because of the rzip preprocessing @ckolivas in a stroke of genious put in, the net size of the data to be compressed was reduced also. So, smaller net data, blocks being compressed in parallel, means better performance. You can time it out for yourself the impact of multi-threading using different values for -p.

If you examine the source for the two functions setup_overhead() in util.c and open_stream_out() in stream.c you will notice that for each compression level, a dictionary size or block size (for zpaq) for compression is computed. The higher the level, the larger the memory requirement TIMES the number of threads. So, if you want the largest possible dictionary size and largest possible block to be passed to the backend compressor, using -L9 -p1 is your ticket. Note that depending on available memory, the actual dictionary/block size and number of threads compressing in parallel may be reduced. Again, see open_stream_out() and observe output if using -vv.

However, since lrzip and lrzip-next already hashes data using rzip before passing it to the backend, the hard work is done and the backend compressor can focus on compressing the remaining data. That, plus using multi-threading, makes the process efficient. When you use -p1 as much data as possible fitting into one block is sent sequentially to the backend compressor and lrzip-next will wait until the return before processing the next block. This is inefficient. Some options, such as -w, -m, -p are best used in a research capacity, benchmarking, and testing. The -U option will consume all your memory and swap and for most purposes is useless and SLOW.

Every day, I seem to learn more about lrzip-next and its capabilities even though I have been involved with the project since 2007. Since 2019 when I broke off from the main branch of lrzip I have been trying to continue to optimize and improve it, test out new ideas, add new features like filtering (which is useful when compressing binary directories). Your participation helps lrzip-next get better still.

Thanks again for all your time.

lr4d commented 3 years ago

Sorry for the late response

This

sed: 1: "s/^v(.*?-)g(.*)$/\1\2/": RE error: repetition-operator operand invalid
./util/gitdesc.sh: line 58: [: -gt: unary operator expected

indeed looks like you are using an incompatible version of sed for that command. If the issue re-appears you could post output of sed --version. We could change that command for one of the alternatives proposed in #43

pete4abw commented 3 years ago

@lr4d , for @demhademha , I made two changes. 1. Added #!/bin/env bash and 2. put pack the curly braces and commented out sed. However, after that, I left in #!/bin/env bash and put back in describe_tag=$(git describe $tagopt --long --abbrev=7 | sed -E 's/^v(.*?-)g(.*)$/\1\2/')

Waiting to see if anyone finds a problem. So now, gitdesc.sh looks like this.

init() {
    if [ -d '.git' ] ; then
        # Lucas Rademaker
        # If the below does not work using sed. comment the line
        # amd uncomment the three descrive_tag= lines below it.
        describe_tag=$(git describe $tagopt --long --abbrev=7 | sed -E 's/^v(.*?-)g(.*)$/\1\2/')
#       describe_tag=$(git describe $tagopt --long --abbrev=7)
#       describe_tag=${describe_tag/v/}
#       describe_tag=${describe_tag/g/}
pete4abw commented 3 years ago

@lr4d , for @demhademha , I made two changes. 1. Added #!/bin/env bash and 2. put pack the curly braces and commented out sed. However, after that, I left in #!/bin/env bash and put back in describe_tag=$(git describe $tagopt --long --abbrev=7 | sed -E 's/^v(.*?-)g(.*)$/\1\2/')

Waiting to see if anyone finds a problem. So now, gitdesc.sh looks like this.

init() {
    if [ -d '.git' ] ; then
        # Lucas Rademaker
        # If the below does not work using sed. comment the line
        # amd uncomment the three descrive_tag= lines below it.
        describe_tag=$(git describe $tagopt --long --abbrev=7 | sed -E 's/^v(.*?-)g(.*)$/\1\2/')
#       describe_tag=$(git describe $tagopt --long --abbrev=7)
#       describe_tag=${describe_tag/v/}
#       describe_tag=${describe_tag/g/}