unknownbrackets / maxcso

Fast cso compressor
ISC License
390 stars 23 forks source link

CSO vs CHD #73

Open crashGG opened 1 year ago

crashGG commented 1 year ago

Test sample: Jeanne d'Arc (Japan).iso 1,245,249,536 Byte

Compressed to Jeanne d'Arc (Japan).zip using 7zip, deflate : 700,926,052 Byte

Compressed to Jeanne d'Arc (Japan).cso using maxcso, default parameters:752,663,981 Byte Because the address seeking information is added during cso compression,So it's a bit bigger, which is reasonable

Compressed to CHD using parameters:-c cdzl,cdfl : 486,284,966 Byte Do not use lzma encoding, only deflate is used to create the same conditions as the above two opponents to ensure fair comparison

2022-09-13_233153

CHD can also address seeking, the size is much smaller than the previous two, even far smaller than the direct zip compression of the file. how could it be?

All three are lossless compression, and after decompression, they are all the same as the original file checksum.

unknownbrackets commented 1 year ago

Given the quality and accuracy of information about CHD (and what it supports or does), it wouldn't surprise me if it was just using lzma after all regardless of what this info tool says.

I'll also note that it's using larger block sizes than you're using with CSO - at least 10x larger, which will lead to better compression. You should try --block=16384 or --block=32768. This will decompress a bit slower than 2048, but not nearly as much slower as CHD would. On any modern phone/laptop/PC, the difference in memory usage is small.

It might be that there are duplicate blocks in the ISO and that CHD is detecting and reusing these blocks, which CSO as a format doesn't support (but could - of course, clients of it would need to change.) A zip file might not handle this situation well if the blocks were far apart.

That said, I've heard (and seen from a couple tests of different ISOs; I don't own the Japanese release of that game) that most of the time, a CSO with larger block sizes compresses about as well.

Anyway, this doesn't really sound like an issue about maxcso.

-[Unknown]

crashGG commented 1 year ago

For chd, I did not use the default parameter compression, because the default compression would have lzma encoding involved. If lzma is involved, the overall file size will be smaller.and encoding slower. It can also be seen from chd info that 49.7% of the blocks are compressed with deflate, and 50.3% are copied directly. This tool to view chd info is here https://github.com/umageddon/namDHC For cso,I used maxcso last stable version with default parameters,it means block setting is already large,if use --block=2048,the final file size will be about 9% larger. I think the reason why the size of the chd much smaller is that for the image with model1/2048, is removed the auxiliary data (EDC ECC) during the compression process. Of course, this is not an issue of maxcso. The reason why I sparked this discussion was whether could learn from the ideas and algorithms of chd compression when compressing cso, so as to greatly improve the efficiency of compression.

unknownbrackets commented 1 year ago

No, the default block size is 2048 for files smaller than 2GB, such as this one. If you got a file smaller by 9% with default settings, it was some other parameter.

Also, your understanding that PSP ISO files contain error correction bits is incorrect. That's only for PSP CD ISOs, in MODE2, which maxcso does not support anyway. DVD and UMD ISOs do not contain this.

-[Unknown]

crashGG commented 1 year ago

By reading the comments in the code, I finally understood that “Copy from self” does not directly copy the data block without compression, but checks the data blocks during compression, and logically copies the same data block when it has the same hash as it had before, and do not taking up storage space more. This function has a significant improvement in compression efficiency for multiple identical files or parts of files in an image. This is also the reason why in this sample, the chd compression efficiency can significantly exceed cso, and 7z deflate. When this condition is not met, chd compression efficiency is not much higher than cso.

So, can this simple checking duplicate blocks in compression be also introduced into cso?

crashGG commented 1 year ago

By reading the comments in the code, I finally understood that “Copy from self” does not directly copy the data block without compression, but checks the data blocks during compression, and logically copies the same data block when it has the same hash as it had before, and do not taking up storage space more. This function has a significant improvement in compression efficiency for multiple identical files or parts of files in an image. This is also the reason why in this sample, the chd compression efficiency can significantly exceed cso, and 7z deflate. When this condition is not met, chd compression efficiency is not much higher than cso.

So, can this simple checking duplicate blocks in compression be also introduced into cso?

unknownbrackets commented 1 year ago

The CSO that is supported by various tools, as a format, doesn't support that. A new experimental CSO format (i.e. like CSOv2) could be created to do that though, yes. Software would need to be updated to support it (much like software would need to be updated if they added new features to CHD, or even PNG, or any other format.)

There are more tricks a new format could use. Most compression formats have a minimum overhead of at least a few bytes, so zero-sized and 1-byte sized could have special meaning. A four byte sized block could indicate a reference to another block. Zstd could be used fairly trivially. The trickiest thing is to decide if (and how precisely) a dictionary should be used to improve compression. This could all be done while maintaining decompression speed and keeping blocks and sectors aligned, for efficiency.

PPSSPP already uses zstd, for example, so it wouldn't even add much to support such a format.

Anyway, it hasn't been high on my priority list as I'm usually investigating specific behaviors of the PSP and making PPSSPP's emulation more accurate. Adding a new variant of CSO (which wouldn't be supported on PSP or PS2 hardware, likely) and the confusion that might cause makes me more likely to just work on PPSSPP instead with the time I have.

-[Unknown]

crashGG commented 1 year ago

Can you update the compression libs of maxcso ? The existing libs are too old. Some new version libs increase the efficiency of compression

unknownbrackets commented 1 year ago

They were updated in 2021 - DEFLATE is a stable format and lz4 hasn't really changed much (I think ARM64 decode has gotten faster, but that matters for the decoder - maxcso would compress the same files either way.)

As noted, switching to a different compression alogrithm (i.e. zstd) wouldn't "just work" - it'd create a new version of the format that other tools would have to support.

-[Unknown]