martinellimarco / t2sz

Compress a file into a seekable zstd with special handling for .tar archives
GNU General Public License v3.0
42 stars 0 forks source link

Rounding error #6

Closed codecnotsupported closed 2 years ago

codecnotsupported commented 2 years ago

I was compressing a 250GB tar archive with -s 50K and it oddly enough it seems to have stopped at 4GB: ERROR: Invalid tar header. If this is not a tar archive use raw mode (-r) Maybe the same problem as with https://github.com/martinellimarco/libzstd-seek/issues/3?

https://github.com/martinellimarco/t2sz/blob/84dab3bbbafa04ea215d9994fee636f01e07cfd6/src/t2sz.c#L74-L75 https://github.com/martinellimarco/t2sz/blob/84dab3bbbafa04ea215d9994fee636f01e07cfd6/src/t2sz.c#L104

martinellimarco commented 2 years ago

yes, same problem :) I will fix it this saturday, thank you for reporting it. EDIT: not the same problem. The standard does require these variables to be uint32_t but anyway this is not related to the error message you are seeing. These are the size of individual zstd frames, not the whole file. There is an improvement I will make here but not related to your problem.

martinellimarco commented 2 years ago

That error means that the header of one of the tar section has an invalid checksum. Either the tar file is corrupter or there is a bug in the checksum or isTarHeader function.

I'm not able to reproduce the problem. I just compressed a 500GB .tar archive of random files with -s 50K without problems.

How was the .tar archive generated? Can you try again with -v and send me the output? It contains file names so if it's something private send me the last ten lines replacing the names.

You can still compress the file with -r if you want, that will not parse the tar header and compress any kind of file.

codecnotsupported commented 2 years ago

If I'm correct it's generated with tar cf tarfile.tar directory. Ok. Trying again, with -v...

codecnotsupported commented 2 years ago
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/LIBRARY/images.jpg (14658)
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/ (0)
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/SymDepend.cache (125)
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 2 1515441116.dat (5217361)
# END OF BLOCK (5235200)

+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 2 1516398940.dat (65868)
# END OF BLOCK (66560)

+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 4 1516399059.dat (14674)
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 3 1516398986.dat (58244)
# END OF BLOCK (74240)

+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/DOMDocument.xml (35973633)
# END OF BLOCK (35974144)

ERROR: Invalid tar header. If this is not a tar archive use raw mode (-r)
codecnotsupported commented 2 years ago

Tested different minimum block size. Same output, ends with:

+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/DOMDocument.xml (35973633)
# END OF BLOCK (35974144)

ERROR: Invalid tar header. If this is not a tar archive use raw mode (-r)

I've a feeling it's related to file names length and/or unicode. If all else fails I'll just give you a link to the files, and see if you can reproduce. Not that it's very private. But the hosting platform may put a timeout on downloading the file because of bandwidth limitations.

martinellimarco commented 2 years ago

I modified the code to add a verbose error message in case of checksum mismatch. Can you compile from source and try it? I want to see the wrong checksum.

codecnotsupported commented 2 years ago

Ok.

codecnotsupported commented 2 years ago
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 4 1516399059.dat (14674)
+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 3 1516398986.dat (58244)
# END OF BLOCK (74240)

+ 3 - Backgrounds/901_L04_Ext_Grogar_Lair/DOMDocument.xml (35973633)
# END OF BLOCK (35974144)

ERROR: Mismatching checksum. Expected 0x0000013e but found 0x00000000.
ERROR: Invalid tar header. If this is not a tar archive use raw mode (-r)
codecnotsupported commented 2 years ago

Is DOMDocument.xml the last file of the last block succeeded the checksum? Or is it in the block that fails the checksum?

martinellimarco commented 2 years ago

It's the last that succeeded.

martinellimarco commented 2 years ago

Try tar -tvf archive.tar. It will list all the files in the archive in the order they appear. Look at the first file that appear after DOMDocument.xml and try to create an archive with just that file. Does that work?

codecnotsupported commented 2 years ago

This may take a moment...

codecnotsupported commented 2 years ago
-rwxr-xr-x user/docker     65868 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 2 1516398940.dat
-rwxr-xr-x user/docker     14674 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 4 1516399059.dat
-rwxr-xr-x user/docker     58244 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/bin/M 3 1516398986.dat
-rwxr-xr-x user/docker  35973633 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/DOMDocument.xml
hrwxr-xr-x user/docker         0 2022-07-05 11:05 3 - Backgrounds/901_L04_Ext_Grogar_Lair/design.xfl link to 1 - Stills/Props/001_p39_spotlight_with_fireflies/001_p39_spotlight_with_fireflies.xf
-rwxr-xr-x user/docker     15018 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/PublishSettings.xml
drwxr-xr-x user/docker         0 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/META-INF/
-rwxr-xr-x user/docker         0 2022-07-05 11:42 3 - Backgrounds/901_L04_Ext_Grogar_Lair/META-INF/metadata.xml
martinellimarco commented 2 years ago

I created a file directory structure and a file with the same name but nothing. I was hoping it had to do with the fact that in tar header the filename can be only 100 character long but it's handled correctly.

codecnotsupported commented 2 years ago

I'm trying but I'm running into memory issues trying to extract the archive. For now, here's the archive (compressed with regular zstd) split into 3. cat xfl* > archive.tar.zst to put them together. https://drive.google.com/drive/folders/1-46h4j_6nsq0TQtXqhm4jb_DvdVKgJ3G

martinellimarco commented 2 years ago

I'm downloading it but it will take a while

codecnotsupported commented 2 years ago

Take your time. I'm debugging with GDB & Trying to recreate the bug with a smaller archive.

martinellimarco commented 2 years ago

I successfully reproduced the problem with your archive and reduced it to a 40MB .tar that still shows the problem. It's late at night here, I will resume tomorrow.

codecnotsupported commented 2 years ago

Was this the issue? image

codecnotsupported commented 2 years ago

Just saw the fix. Yup.

martinellimarco commented 2 years ago

Yes. The size of the tar block of DOMDocument.xml was 35973633 bytes. The previous code had a rounding error that caused an offset of 512 bytes.

The test code

        size_t size = 35973633;
        printf("%f\n", ceil((double)size/512.0)*512+512);
        printf("%f\n", ceil((float)size/512.0)*512+512);
        printf("%f\n", ceil(size/512.0)*512+512);
        printf("%f\n", ceil((double)size/512.0f)*512+512);
        printf("%f\n", ceil((float)size/512.0f)*512+512);
        printf("%f\n", ceil(size/512.0f)*512+512);

outputs

35974656.000000
35974144.000000
35974656.000000
35974656.000000
35974144.000000
35974144.000000

I got rid of the floating point and replaced it with integer math.

codecnotsupported commented 2 years ago

https://github.com/martinellimarco/t2sz/blob/main/src/t2sz.c#L311 Should this be strtoul instead of strol because size_t is unsigned?

martinellimarco commented 2 years ago

Thank you for providing the test archive and for reporting the bug. I would have never find this otherwise. Feel free to remove the link to drive, I removed the files on my side.

martinellimarco commented 2 years ago

https://github.com/martinellimarco/t2sz/blob/main/src/t2sz.c#L311 Should this be strtoul instead of strol because size_t is unsigned?

I think you are right but I'll check it tomorrow. Thanks again.

codecnotsupported commented 2 years ago

Thank you for providing the test archive and for reporting the bug. I would have never find this otherwise. Feel free to remove the link to drive, I removed the files on my side.

The pleasure was all mine, thanks for the fix and your time.