unknownbrackets / maxcso

Fast cso compressor
ISC License
390 stars 23 forks source link

CSOv2 #65

Closed JoseAaronLopezGarcia closed 2 years ago

JoseAaronLopezGarcia commented 2 years ago

EDIT: CSOv2 is now working perfectly with the latest ARK-4 release. Big thanks to unknownbrackets for his insight on the issues.

Ok so I've implemented CSOv2 on Inferno driver, and it seems to work for the most part but does appear to crash often. Just to clarify a few of the important differences between CSOv1 and CSOv2:

For testing you can use the latest ARK-4 build here: https://github.com/PSP-Archive/ARK-4/releases/tag/v1649147466

Conversion is of course done with maxcso using --format=cso2 and everything else default.

unknownbrackets commented 2 years ago

https://github.com/PSP-Archive/ARK-4/blob/126c2a90b011babca5e5891eaa49ad72caf2b49f/core/inferno/isoread.c#L338

In practice, equal should work, but I recommend src_size >= dst_size, because it's perfectly valid for gaps to be placed between blocks. In practice, this should only happen for compressed blocks (which might have a length that is not a power of 2), so that shouldn't be the cause of any crashing.

https://github.com/PSP-Archive/ARK-4/blob/126c2a90b011babca5e5891eaa49ad72caf2b49f/core/inferno/isoread.c#L286

This looks questionable too. If the indexShift is 1, and index0 is 12070, and index1 is 12090, then the true byte positions of the blocks are at 24,140 and 24,180 respectively which means the source size is 40 bytes.

It looks like the current calculations would give:

But I didn't look very hard. Also, this appears (?) to have the GTA bug I mentioned earlier: read_raw_data (and therefore sceIoLseek and sceIoRead) are called per each block inside the CSO, not once for all reads.

The performance problem from before was that sceIoRead has high overhead. It's kinda like the difference on modern PCs between SSDs and HDDs. There's two measures of performance: IOPS (I/O operations per second) and bandwidth (bytes per second.) SDDs win on both now, but initially the huge improvement was in IOPS. This is basically the overhead and cost per operation (regardless of size.) The PSP has poor IOPS (likely just from how reads of the memstick are handled.)

So if iso_read() is called asking for 128 KB, on a PSP or using an ISO this would translate to:

From the code, it looks like this would typically be converted (assuming a 2048 typical block size, meaning 64 blocks being read) to:

Let's assume this is music (it is, in GTA's case), and the compression ratio is poor - for simplicity, we'll just assume it failed to compress at all. In that case, we'll get:

I believe the fix implemented in PRO was to read ALL compressed block data into the target buffer, but towards the end. So if ratio were 50% compression, and we were reading 128 KB, we would read 64 KB (that's the 128 KB, but 50% compressed) into the second half of the buffer, and then decompress that into the correct place, from the start. This wouldn't require any additional allocations, and should work for any size of read.

However, for uncompressed blocks, memmove() would be required because the target and source of a block could overlap. For compressed blocks, it might be necessary to use a temporary buffer if the target and source regions overlap.

That would result in only 2 operations, which would have a much lesser impact on performance. As I mentioned before, this is the implementation bug that causes CSOs to perform poorly within some CFW. It's not because of gzip/zlib. It's because of the drastic inflation of I/O operations. The index cache is not enough to entirely fix this, it has to be fixed for the block data too.

Back to the crashing, CSOv2 is more experimental so it might be possible there's an unnoticed bug in compressing them. I'd recommend running maxcso --crc on the file just to be sure it has the correct content (since this will cause maxcso to decode and validate the file.) Not sure what else would be different, your list looks right.

-[Unknown]

unknownbrackets commented 2 years ago

For reference: https://github.com/MrColdbird/procfw/blob/d9435ff4af6ba60466b46019393f6bbd1fc72ac0/ISODrivers/Inferno/isoread.c#L526

-[Unknown]

JoseAaronLopezGarcia commented 2 years ago

Great information as always Unknown, I'll implement that IO improvement you mentioned. Thanks!

JoseAaronLopezGarcia commented 2 years ago

Alright, latest release contains the fixes mentioned in this thread: https://github.com/PSP-Archive/ARK-4/releases/tag/r160 CSOv2 now confirmed working fine (even up to 8K blocks). GTA LCS IO greatly reduced (given the proper format and a good memory stick, there is zero lag now).

Code is here: https://github.com/PSP-Archive/ARK-4/blob/802d0fa27cea022f375482727c47066d71f9ee1d/core/inferno/isoread.c#L258

JoseAaronLopezGarcia commented 2 years ago

One thing to note: while the implementation of reading all the compressed data at once does have some benefits, it's not as noticeable as the block offset cache. This is because the ms cache already does an excellent job at reducing IO for compressed blocks. However ms cache doesn't work for small amounts of data (like reading a single block offset).

Either way, from my experience, I've found that ZSO and JSO with 8K block size are (by far) the best performing formats, reaching ISO speeds (zero lag) on any decent memory stick.

unknownbrackets commented 2 years ago

Sounds like everything is fine now, closing this.

-[Unknown]