raehik / procfw

Automatically exported from code.google.com/p/procfw
0 stars 0 forks source link

Support LZ4 compression in Virtual ISO mounting #225

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
LZO[1] is a fast compression library, which requires no additional memory for 
decompression, nor changes the decompression speed when higher compression 
settings are used.

I think using LZO should allow fast caching, as sectors could probably be 
loaded into memory, remaining compressed until needed. And will probably reduce 
power usage as required by CSO

[1]http://www.oberhumer.com/opensource/lzo/

Original issue reported on code.google.com by hastur...@gmail.com on 22 Aug 2011 at 10:45

GoogleCodeExporter commented 8 years ago
I guess I'm spoiled too as I've not had to do printf debugging in ages and 
instead rely upon gdb with break points. But anyway basically I try ~3-4 games 
of varying types currently the testing is on the following games.

Capcom Classics Remixed(arcade games emulated), Dungeon Siege Throne of 
Agony(game that seems to do the most random reads), Untold Legends Brotherhood 
of the Blade for right now as I think it should be enough. Mostly it's 
repeatedly doing the same thing over and over with cso, iso, and zso along with 
the random sleeps now since I know that was the issue at part of it.

The first game is the only one with a lot of random files I think, I may try 
gta game because they also have a ton of random files as the other two are just 
big wads with most of the files in it.

I'm hoping that that's enough, mostly doing map changes/things that cause the 
thing to read stuff from the memory stick.

Also my cache is 23MiB with the default 256K the reason for the 23 instead of 
20? I'm not doing online stuff and I'd like to hold as much as possible.

Original comment by 133794...@gmail.com on 8 Jul 2014 at 10:07

GoogleCodeExporter commented 8 years ago
OK it seems to have fixed those issues I can't seem to find any way to break 
the thing and I finally found a game that has a ton of random files all over 
the disk. It's Marvel Ultimate Alliance 1/2 they're both ~1.7GiB with only 
~300MiBish in a solid file and the rest in random directories all over. I think 
the reason why it's so big is because the game has something like 20ish 
characters all able to talk/speak various lines allthrough the entire campaign 
along with audio samples all over the battles.

It's seemingly doing AOK right now but my god I can't imagine playing it 
without the isocache as the thing does a loading screen to open the menus and 
even between them. I know the game was on the ps2 which had ~12MiB more ram 
than the psp available to devs but even with csos the loading screens aren't 
that bad. I'm going to try the csotest onit though but thusfar doing multiple 
load screens and trying my darndest to try to trick the thing by randomly 
forcing sleep during loading screens and the like and it's still holding up AOK 
thusfar with iso,zso, and cso.

Original comment by 133794...@gmail.com on 9 Jul 2014 at 3:39

GoogleCodeExporter commented 8 years ago
Whats the point in having lz4 compression i don't see any major size diff and 
whats will the benefit be?

Original comment by shatteredlites on 11 Jul 2014 at 9:53

GoogleCodeExporter commented 8 years ago
Speed. The lz4 algorithm combined with the rewrite of the read functions means 
that the read speed of compressed isos will be almost the same as a 
uncompressed one. There are benefits with the patch even if one doesn't use 
lz4, In fact the size of a lz4 compressed file is bigger than a cso.

Original comment by codestat...@gmail.com on 11 Jul 2014 at 10:45

GoogleCodeExporter commented 8 years ago
For me lz4 isos are not only way faster than csos but sometimes they're as fast 
as isos and even sometimes they can end up being very very very slightly 
faster. That combined with the fact that it's not only better than gzip makes 
it an amazing thing to have that makes life amazingly better.

Also as far as the size of the lz4 compressed ones if  you're using lz4hc on 
average it'll give you ~the same compression size of level 6 on gzip/cso files 
whilst being dozens of times faster and at the very least 2-3x as fast to 
decompress on the psp.

Also compressing with a zso makes a smaller file than the eboots from 
popstation I wish that the popsloader spoke it but apparently that's using it's 
own internal system.

Finally lz4/lz4hc is being used almost everywhere now. All frostbite games aka 
almost all EA games are using lz4hc for all of the files, filesystems are using 
it to make it all faster. Originally lzo was filling this requirement but lz4 
is way faster, if you're running zfs(bsd) or btrfs you can use lz4 compression 
and actually gain IO speed even for ssds.

So to just finally say it again, speed speed and more speed. csos used to take 
forever, with the update they don't take as long, but with zsos you can have 
compression(most of the time that's decent) whilst also not suffering form 
vastly increasing your load times.

the exceptions are 1k tiny claws(game uses lz4hc already), and other minis 
don't compress that well due to how small they are, and neither does marvel 
ultimate alliance/the warriors both are games that seem to be originally ps2 
games that were moved over to the psp. But generally I get ~3% worse than cso 
-9 with zsos and way faster load times so for me this is amazing.

Finally as far as testing goes I haven't found any issues yet with loading any 
of the games that I've tried and I've been trying to find some way to trick the 
system into not running OK by tweaking the iso cache values and randomly 
putting it to sleep and it seems to be holding up perfectly atm.

Original comment by Jimmydea...@gmail.com on 11 Jul 2014 at 11:03

GoogleCodeExporter commented 8 years ago
Hmm I wonder would this fix the frame drop in GTA games? Im not sure if this 
was an issue that was in the game when it was made, or if its the loading times 
of a compressed cso.

Original comment by shatteredlites on 12 Jul 2014 at 10:05

GoogleCodeExporter commented 8 years ago
>earlier in the thread if you managed to I don't know do a search.

>On my tests i managed to play GTA Liberty city stories from a CSO compressed 
at level 9 (cpu at 333 and ms access speedup enabled) without virtually any lag 
whatsoever. 
>Also tested other 2 compressed games with similar results.

And that's with a CSO a zso has much much lower latency/lag times and thus 
works a ton better. Finally in my tests on testing this I have one single weird 
psp crash where it just shut itself down after constantly reading data(or 
writing it) on the psp I'm currently trying to get it to happen again without 
sleeping it as I did before.

But anyway testing is still going semi-ok besides that one single weird crash 
but until i can force it to do it again I'll try to start up psplink.

Original comment by Jimmydea...@gmail.com on 12 Jul 2014 at 10:14

GoogleCodeExporter commented 8 years ago
OK I've been trying to find a way to get this thing to crash in a way that's 
repeatedable but i'm unable to do so. The warriors seems to be crashing but 
that could have just been a bad iso on my part but on games that ran I can't 
get it to crash repeatedly with a known good iso/zso/cso so it seems to be AOK 
right now.

Original comment by 133794...@gmail.com on 2 Aug 2014 at 2:47

GoogleCodeExporter commented 8 years ago
This is an interesting change.  I was looking into the cso format and also 
thought lz4 would be better, and found this.

I personally think it would be more interesting to mix deflate and lz4.  This 
would allow, for example, faster decompression of some sectors, and better 
compression of others.  The "plain" flag could easily be used for this purpose, 
as long as the format guaranteed that a block of size 2048 would never be 
compressed (why would it, anyway, except to waste cpu time?)  This would be 
fairly easy in the code.

The read reduction is definitely a good optimization.  A small tweak I might 
recommend is that if the data is compressed, and dst <= src - block_size, a 
memcpy() isn't necessary - can just decompress directly to dst.  This might 
happen with a 32KB read where the compression ratio is decent, for example.  
Though, this might make it harder to cache the block (are you doing that?)

I'd like to note that CISO_DEC_BUFFER_SIZE is way too big.  It's 8KB, but no 
matter what code path, it'd take a 128GB iso to use 2176 of it (with 2048 byte 
blocks.)  That being said, I think allowing for a larger block size would be 
nice.  Largely, this would only mean replacing some instances of 
ISO_SECTOR_SIZE and 2048 with g_CISO_hdr.block_size (which could be changed to 
a shift value if you're into micro-optimization.)

A larger block size (e.g. 4KB) would halve the index size (improving its 
cache), and allow for better compression.  For >= 4KB reads, this might result 
in faster reads.  It might slow down sector-by-sector reads, though.  This 
could be cured by using CISO_DEC_BUFFER_SIZE as a cache in decompress_block() 
(which might also help multiple small < 2KB reads in the same sector?)

However this is finalized, I can add support to ppsspp as well.

-[Unknown]

Original comment by unknownb...@gmail.com on 25 Oct 2014 at 6:24

GoogleCodeExporter commented 8 years ago
@Uknown

When I first suggested that we do lz4 it was because the linux kernel was 
proving that it did very well with 4k sectors and above I also believe that 
it's been talked about trying to do 4k secotrs as it'd much better increase the 
compression ratio for all of the files. I'm sure that this'd greatly help games 
that have a ton of their space taken up with pre-compressed files as you'd get 
more of them to get their fluff off of them. Now in terms of mixing deflate and 
lz4, that'd not do much at all if you look at what I've said above.

lz4 hc is about the same compression ratio of deflate level 6 and on psp umd's 
the difference between level 6 and level 9 on 2k is miniscule most of the time 
from what I've seen it's ~3-5% difference at most which is next to nothing and 
if you ask me would greatly increase the complexity of the decompression 
algorithm. More likely to do better for it all is to add the ability to do 4k 
sectors and then have the option to store a sector uncompressed if the file is 
not compressible.

Original comment by 133794...@gmail.com on 25 Oct 2014 at 7:07

GoogleCodeExporter commented 8 years ago
As mentioned, it would be a small code change to support both compression 
methods simultaneously.  I say this as a developer looking at the patch.  No 
need for you to guess that it might be a complex change.

Supporting alternate block sizes is certainly a more complex change, but 
probably fairly easy to verify by tweaking ciso.py to spit out an iso of 
alternate size.  It should be easy with the "ng" path, as well, more trouble 
with the fallback.

-[Unknown]

Original comment by unknownb...@gmail.com on 25 Oct 2014 at 7:22

GoogleCodeExporter commented 8 years ago
I did all of the testing on the patch(es) previously and as I said I don't 
think it'd help much at all. That's what I was talkinga bout, the relative 
complexity isn't a huge thing beyond having to redo all of the build tests by 
repeatedly tring to catch teh driver in an unknown state. But since this is 
about to trying to get it make the cisos smaller, I don't see many reasons why 
deflate/lz4 would honestly improve much of anything. On files that don't 
compress well both deflate level 9, and lz4 hc don't compress them very well. I 
can't honestly think of many that'd truly make it a better decision than simply 
upping the block size.

Original comment by 133794...@gmail.com on 25 Oct 2014 at 7:42

GoogleCodeExporter commented 8 years ago
Well, this can easily be measured.

I've added experimental support for lz4 in my cso compressor I've been toying 
with (mostly playing with libuv), and support for both cso and my proposed 
format.

https://github.com/unknownbrackets/maxcso/releases

It can take existing cso, dax, or zso files as well, so no need to decompress 
your inputs.

Using Crisis Core (ULUS10336) as an example, I've put some data at the end.  
Conclusions are here, scroll down for the data.

At block size 2048, using lz4 definitely comes at a cost.  8 points worse 
compression ratio, which is certainly more than 3%.

For a user who wants to maximize space, mixing lz4/deflate is a win.  They get 
a smaller AND faster loading file.

For block size 8192, the first impression is obviously "huge win."  This varies 
widely by input file, though, and larger block sizes may hurt decompression 
speed in some scenarios.

With this larger block size, you lose only 4 points to lz4, which is clearly 
better.  Again, mixture allows for slightly better compression and lots of lz4 
usage (so better decompression speeds than zlib alone.)

As far as complexity, here's the patch that enabled both lz4 and deflate 
reading (as well as zso) in maxcso:
https://github.com/unknownbrackets/maxcso/commit/cc55a619fb852f91dd8424aef33594e
af79840c0

Very simple.  I've also written up a more formalish spec of the format:
https://github.com/unknownbrackets/maxcso/blob/master/README_CSO.md

Personally I think supporting 2K, 4K, or 8K block sizes would be ideal.  It has 
clear benefits, at least in some games, despite the change cost.  lz4+deflate 
allows everyone to win: same or better compression ratios, faster loading 
times, and people can still use 100% lz4 if they want.  Plus it's easy to 
support if adding lz4 and retaining cso v1 anyway.

-[Unknown]

Data:

Original ISO: 1716715520

Block size 2048 (standard):
ZSO (with lz4hc level 16): 1134591998 (66.09% ratio, 66.42% blocks lz4)
CSO (with default zlib level 9): 1006322266 (58.62% ratio, 71.06% blocks 
deflate)
CSO (with 7zip + zlib level 9): 996709608 (58.06% ratio, 72.28% blocks deflate)
CSO v2 (7zip + zlib 9 + lz4hc 16): 996393914 (58.04% ratio, 4.65% lz4, 69.83% 
deflate)
CSO v2 (zlib 9 + lz4hc 16 5% bias): 998737678 (58.18% ratio, 24.02% lz4, 49.94% 
deflate)

Block size 8192:
ZSO (with lz4hc level 16): 563252647 (32.81% ratio, 100% blocks lz4)
CSO (with default zlib level 9): 499600821 (29.10% ratio, 100% blocks deflate)
CSO (with 7zip + zlib level 9): 493327409 (28.74% ratio, 100% blocks deflate)
CSO v2 (7zip + zlib 9 + lz4hc 16): 491593962 (28.64% ratio, 29.80% lz4, 70.20% 
deflate)
CSO v2 (zlib 9 + lz4hc 16 5% bias): 497000736 (28.95% ratio, 54.98% lz4, 45.02% 
deflate)

Block size 4096 (selected):
ZSO (with lz4hc level 16): 575563344 (33.53% ratio, 100% blocks lz4)
CSO (with default zlib level 9): 517909560 (30.17% ratio, 100% blocks deflate)
CSO v2 (7zip + zlib 9 + lz4hc 16): 506935875 (29.53% ratio, 31.25% lz4, 68.75% 
deflate)

Block size 65536 (selected):
ZSO (with lz4hc level 16): 541828787 (31.56% ratio, 100% blocks lz4)
CSO (with default zlib level 9): 478272771 (27.86% ratio, 100% blocks deflate)

For your reproduction, the arguments in order are (append --block=X for block 
sizes):
maxcso in.cso -o out.zso --format=zso
maxcso in.cso -o out.cso --fast
maxcso in.cso -o out.cso
maxcso in.cso -o out.cso --format=cso2
maxcso in.cso -o out.cso --format=cso2 --lz4-cost=5

Original comment by unknownb...@gmail.com on 1 Nov 2014 at 11:01

GoogleCodeExporter commented 8 years ago
Have you done some tests with games that load data continuously from the ISO 
(like GoW: GoS and GTA games)? These patches were done with the idea to allow 
the user to have a fairly compressed game while enjoying a lag-free experience 
like a uncompressed ISO.

About storing uncompressed sectors, this is already supported on the ciso.py in 
this project, one chan choose the compression threshold for when the compressed 
block is discarded and just use the uncompressed block in its place.

I agree completly that the cso format needs for work with bigger sector size, 
but the problem is that procfw assumes in a lot of places that the sector size 
is 2048. A deduplication of code is needed so this can be maintained more 
easily (right now, Inferno, Galaxy and vshctrl uses their own decompression 
code for cso).

Original comment by codestat...@gmail.com on 2 Nov 2014 at 12:09

GoogleCodeExporter commented 8 years ago
I'm not talking about uncompressed sectors.  I'm talking about lz4 vs deflate 
sectors.  My proposed "cso v2" format allows for storing both lz4 and deflate 
in the same file (as well as uncompressed blocks, purely based on the size of 
the compressed block... >= block_size means uncompressed, since it's silly to 
compress it in that case anyway, just wastes cpu time.)

I'm mostly concerned about having a good format that can improve and make sense 
for both cfw and emulators (like ppsspp) to support.  The difference of 5% can 
mean gigabytes in total, but lz4 can nevertheless improve load times on both 
the PSP and on Android devices.

Anyway, the format itself does support larger block sizes, and ppsspp already 
handles them since this pull: https://github.com/hrydgard/ppsspp/pull/7027

But yeah, I saw the trouble with duplication and hardcoded/misused vars, so I 
can see how supporting larger block sizes is a pain.  Still, I'd like to see a 
mixed format (lz4 + zlib) rather than an lz4-only format, considering how easy 
it would be to support.  All it takes is setting lz4_compressed for the 
0x80000000 flag, and reading raw data for when size >= g_CISO_hdr.block_size.

As for testing; I have not done much speed testing on a PSP device.  I'm 
assuming that lz4 is faster from your testing (I mean, I know it's faster on 
desktop, obviously.)  If the io read size is approximately the same, and lz4 is 
faster, then a file of roughly the same size but composed of 30% lz4 blocks 
(and 70% deflate) will naturally decompress faster than one of 100% deflate.

However, I can do some benchmarks on a PSP, maybe next weekend.  Reading the 
above a bit more, I think I misunderstood and thought the csotest was run on a 
PSP not a PC.

-[Unknown]

Original comment by unknownb...@gmail.com on 2 Nov 2014 at 2:05

GoogleCodeExporter commented 8 years ago
Oh, but I actually have neither of the games you mentioned (except the demo of 
GoW: GoS, but not sure how to get hard numbers out of running a game anyway.)

I'd have to do synthetic tests.  I can just read the cso manually and generate 
timings based on lz4 vs zlib and possibly the impact of block sizes.  If you 
have the access patterns, that would help (e.g. does it generally re-read the 
same blocks but after reading other non-sequential blocks?  or does it just 
read different blocks all the time?)

-[Unknown]

Original comment by unknownb...@gmail.com on 2 Nov 2014 at 2:17

GoogleCodeExporter commented 8 years ago
Which levels did you do for lz4? There is _no_ lz4hc level 16, I don't know 
where you're getting it from. The api has no such thing as those levels. It's 
lz4 or lz4hc in terms of the api. So I don't know where you're getting it from. 
Ciso also only does lz4 and lz4hc.

Finally, lz4hc was chosen because it takes way less cpu time in terms of 
decompression which is a _huge_ thing for the psp as not everyone has millions 
of cpu cycles to waste. Zlib the higher the level, the more cpu time is taken 
during decompression. Whereas lz4hc is the same decompressor as lz4 so it's 
constant speed in terms of decompression. 

Also the whole point of lz4hc is compare it with zlib level 6. As I had said 
previously, that's what lz4hc was comparing against. Teh zlib level 6 
compressor and if you're comparing it with level 9 obviously it's going to lose 
more. I said it's in general ~3% more compressibiliy in terms of level 6 vs 
level 9 in most of the games.

And I know for a fact that the games lag like balls when you're trying to play 
it with zlib level 9, hell even level 6 most of the time _increases_ my loading 
times for the games that I've tested. It only makes it slower, so why would I 
want to increase them?

LZ4 was made to decompress at maximum speed humanly possible.

The csotest was one to make sure that there was _no_ issues in a series of 500 
reads. It's a basic test of the iso driver to test for memory leaks/obvious 
problems. It still didn't catch everythign as my own testing showed.

And as far as 30% lz4 and 70% deflate, you're still going to end up harming the 
loading speed. There's very few places that lz4hc will be that far behind zlib 
in terms of the sectors that compress better with zlib. I was playing marvel 
ultimate alliance which is a game with a crapton of loading, it's constantly 
streaming stuff in. 

Each of the characters has their own catch liens or whatever and they're always 
being swapped in and out, and loading is pretty common place all throughout the 
thing. It was also a game that didn't compress that well. zlib level 6 made my 
avg load times from my memory stick by ~13s worse compared to raw iso. Whereas 
lz4hc was about the same and in some cases a bit better. It was +/-2s from the 
default in most of the cases as a vast mamority of the game's assets are 
atrac3(or what I imagine they are) compressed sound files it's almost like 
500MB or something of the entire iso.

So anyway yeah, the whole reason was to add compression support that made the 
games load better whilst also providing more space. And with intermingling them 
I can't imagine it'd be really worth it that much in terms of the format 
itself. I can't see it doing much good as if you did tests with zlib level 6 
intermingled with lz4hc(what is the good compression ratio for both) it'd not 
end up with much of an improvement.

LZ4 is meant to be blazingly fast and is open source, lz4hc is meant to be 
slower to compress but just as blazingly fast to decompress. My tests were on a 
mips cpu of 1ghz with mddr ram which is slow as balls. I don't know how 
slow/fast the psp's ram is but I do know that lz4hc is always faster loading 
versus zlib level 6.

Original comment by 133794...@gmail.com on 2 Nov 2014 at 2:27

GoogleCodeExporter commented 8 years ago
also if you read it, it's 3-5% better compression of a cso when using zlib 
level 9 vs level 6. As in it's ~5% better compression for zlib level 9 versus 
zlib level 6, and lz4hc is similar to zlib level 6 in almost all cases and 
loses a few perecent to it, but it greatly grealy makes up for it by not 
wasting precious cpu time.

As for android devices lz4 will likely have a good result on it like the psp, 
probably not as prnounced as most android phones have plenty of cpu and ram so 
it's not going to be the difference between waiting forever and loading at an 
OK speed. 

I'll probably try to redo gow of war and gta games again and compare the 
loading times but I know that for the games that I did test it was always 
slower than lz4(hc) and most of the time made the loading times much worse. 
Since you're bottlenecked by teh cpu trying to decompress the blocks instead of 
being IO bound(as it normally it is with the uncompressed games) lz4 is always 
going to have some effect on the games depending on how well it compresses but 
it hasn't made anything that I've played lag worse than it did with plain iso.

Original comment by 133794...@gmail.com on 2 Nov 2014 at 2:34

GoogleCodeExporter commented 8 years ago
Yes, the csotest was run from a PC, debugging on psp is a pain and i needed to 
use valgrind to proof test the read rewrite that i was doing.

I suppose that i can generate the read patterns of GTA and make a log with the 
type of blocks that it tries to read. The only thing that i can remember from 
the top of my head right now was that GTA reads on blocks of ~80KB and the lag 
happened because the next read was not ready at the time it needed to continue 
loading the city.

Anyway, the real bottleneck of the cso format was on the read method (2k block 
reads at a time totally killed any performance gains) and the deflate/lz4 
compression doesn't have that many impact.

With my limited time the most that i can do is to work on adding support for 
the cso2 format and variable block size for the Inferno driver and vshctrl (so 
the games can be displayed on the XMB).

BTW, i haven't done many tests but the psp memory on kernel mode is really 
scarse. More than 64k of block size and there is a chance that the game won't 
run at all if the user has enabled many heavy plugins. Not sure if this limit 
should be enforced of the format or not since it only affects the psp (and the 
pspemu on the vita).

Original comment by codestat...@gmail.com on 2 Nov 2014 at 2:39

GoogleCodeExporter commented 8 years ago
One last final post for today, as far as the tests go.

Here's how I did it.

I got the game, I did the same thing over and over ~10 times in a row timing it.

After I did the test, I did a cold boot again(to make sure there's nothing 
lying around in the iso cache) and then did it over and over.

For some of the games, I simply loaded my save file, and went to some place 
where I knew there'd be another loading screen. Others I went through the same 
3-4 levels measuring the timing for them. That's the easiest way to get the 
same results(hopefully) from your tests. YOu do the same thing over and over 
and over again and time it. Since you're doing the same thing, the gamae's 
likely to want to do the same things there'll probably be some varience on gta 
since people move and such.

But for the most part, it's how I measured the loading time differences on my 
psp. I kept them all the same settings, no plugins, everytihng at the same 
level, read the same cso/zso over and over and over again and went from there. 
And synthetic benchmarks are pretty much useless.  As they say figures don't 
lie but liars can figure, go look at amd/intel/nvidia they all use the same 
synthetic benchmark and can game the system.

Look at the estimated MPG here in the US all of the car companies know what the 
testing track is and the US only changes it up every 5-10 years. So each car 
company can modify their engines so it'll come out better than it actually is. 
It's the same thing with android, phone makers knew what the benchmarks were 
and were able to make them score higher than they really were. Real world tests 
are the only real way to figure out the performance characteristics is to make 
a real test and do it in the real world.

I'm sure there's some way to make the system log out all of the lba reads and 
you could record those and then use that for your synthetic test by reading 
those certain blocks in the same order that the psp did by sticking that memory 
stick into your computer and running the tests on it.  I've been way too far 
behind on trying to put debugging into the cfw's iso reader mainly due to 
health issues/life but that could help to give you some information about how 
the games do their reads.

Original comment by 133794...@gmail.com on 2 Nov 2014 at 2:44

GoogleCodeExporter commented 8 years ago
https://code.google.com/p/lz4/source/browse/trunk/lz4hc.c#611
https://code.google.com/p/lz4/source/browse/trunk/lz4hc.h#70

I'm using a value outside the "recommended values", and actually, I'm trying 
multiple levels because sometimes a lower level actually saves a couple bytes 
vs a higher level.

I realize that python-lz4 doesn't expose it, but that doesn't make it not exist.

You are *absolutely* wrong about zlib.  Higher levels DO NOT take more 
decompression time.  In fact, due to reduced io time, they can take LESS time.  
The only reason why a more heavily compressed cso file could read slower is 
because of blocks that were uncompressed before and are compressed now.

As an example, look at Google's Zopfli.  There's a reason why everyone is 
trying to max out the compression level of deflate for the web, and it's 
exactly because it DOESN'T increase the decompression time on the other end.
http://techcrunch.com/2013/02/28/google-launches-zopfli-compression/

You can see decompression times here as well (and surely on many other 
benchmarks published in the last few decades):
http://tukaani.org/lzma/benchmarks.html

I realize this misinformation is well spread among cso users, but it's just 
wrong.  There's no evidence to back it up and plenty of evidence to the 
contrary.

Nevertheless, lz4 decompresses faster than zlib (at least on most 
architectures, and I thought on MIPS as well.)  That being said, it's not clear 
to me if you are conflating the results of the io optimization codestation made 
(which is not directly related to using lz4, but was done with the patch) with 
the results of lz4 decompression itself.

Irregardless, this doesn't mean the format should not support the use case of 
people wanting to save space while at the same time supporting the use case of 
people wanting maximum speed.  Even if the file being 100% lz4 results in the 
absolute best performance, that doesn't mean that should be the only thing the 
file format supports.  Why not let everyone have increased performance, and 
people who care about disk space have some of both?

-[Unknown]

Original comment by unknownb...@gmail.com on 2 Nov 2014 at 2:50

GoogleCodeExporter commented 8 years ago
You're comparing web browsers vs a low-powered mips cpu. And I already know of 
zopfli it does little more than running 7zip's zip compressor with 15 
iterations when I was doing it's one on 100 iterations. Also with deflopt.exe 
zopfli compresses worse than the previous program+advzip's compressor on level 
4.

Finally about the "amazing speed of zlib level 9 being so fast."

Look at lz4, I've ran the benchmarks and from ram it's very similar for me.

LZ4 HC (r101)   2.720     25    2080
zlib 1.2.8 -6   3.099     21     300

lz4hc decompression is 2GB/s zlib? 300MB/s insanely _massive_ difference in 
terms of decompression time. I can't imagine that you'd get a better result in 
terms of decompression time when mixing the two. Considering that lz4hc is 
almost on an order of magnitude faster than zlib. And you're also wrong about 
the levels not being exposed, if you do the cli program for lz4 you have lz4/hc 
and you can increase the block size that's the only "level" that it does beyond 
the two different modes. That's it, it'll increase the block size and the only 
other thing that also makes it compress better is making each block depend on 
the previous block. Which is the default mode of zlib.

In your tests were you making sure that the lz4hc was using inter-block 
dependency as is the default mode for zlib? Also you're throwing out the blocks 
that weren't compressible to a certain percent it seems. I didn't do that with 
my ratios, I did the full thing, it was compressed even if it ended up making 
the files bigger. Also ff7 cc is basically just one huge archive file. I don't 
know if it's using any compression or if it's just the raw data. I believe it's 
just the raw data encrypted some-how.

And zopfli and efforts like it are nice to see, but most sites don't do gzip 
compression above level 1 by default. Almost every company out there if they're 
doing gzip compression have kept it on level 1 even for static assets.

Original comment by 133794...@gmail.com on 2 Nov 2014 at 3:21

GoogleCodeExporter commented 8 years ago
I forgot to say right here is the results of ciso with zlib -9 and lz4 hc. And 
what do you know ~5% difference in terms of compression ratio. Hmmm.... I 
wonder why that matches up _entirely_ with my internal results that it's ~3-5% 
difference in terms of compression ratio in total for them.

$ ./ciso.py -m -a 0 -c 9 ff7_cc.iso ff7_cc.cso
ciso-python 2.0 by Virtuous Flame
Compress 'ff7_cc.iso' to 'ff7_cc.cso'
Compression type: gzip
Total File Size 1716713472 bytes
block size    2048  bytes
index align  1
compress level  9
multiprocessing True
ciso compress completed , total size = 1173993434 bytes , rate 68%

$ ciso ff7_cc.iso 
ciso-python 2.0 by Virtuous Flame
Compress 'ff7_cc.iso' to 'ff7_cc.zso'
Compression type: LZ4
Total File Size 1716713472 bytes
block size    2048  bytes
index align  1
compress level  9
multiprocessing True
ciso compress completed , total size = 1267014960 bytes , rate 73%

what do you know it says _exactly_ what I said, so why are you being so insane 
trying to act like I'm pulling numbers out of my ass for? I said it wasn't a 
huge difference this is with the default setting for both. no minimum 
compression threshold and it's _identical_ to what I had said above. So I still 
stand behind my comments and it seems my memory wasn't failing me at all.

IF you're doing something _that's not default _at all_ as you were doing sure 
you may come up with teh figures that you've said but as I've said a difference 
of 5% seems to be the maximum of the scale. So one final time, my figures were 
correct and your hand-crafted benchmark using non-default situations is no 
where near what mine was.

And finally/for the record it seems like UMDgen's ciso compressor shaves off at 
most ~1-2% in the best cases compared with ciso.

Original comment by 133794...@gmail.com on 2 Nov 2014 at 3:33

GoogleCodeExporter commented 8 years ago
codestation: indeed, debugging on the psp is a pain.  Sorry, didn't see your 
post before, guess I started typing right before you posted.

Well, 80kb blocks sound like they would be most influenced by the io read speed 
and the number of io operations.  So that is most likely influenced by the 
optimization you made.  It could even be a simple timing/scheduling thing (i.e. 
the actual io may not be taking longer in sum, but it may be scheduling to 
other threads... not sure if the sceIoRead() schedules based on the priority of 
the calling thread, or the kernel?)

Yeah, I realize memory is scarce.  As mentioned, CISO_DEC_BUFFER_SIZE is 
already too large.  I don't think block sizes larger than 8KB make any sense, 
the point of those numbers was to show that the gains got small.

133794: not really sure why you're angry that I get better compression ratios 
than you.  I'm sorry but I don't really have the time to respond to all the 
things you've said, but "everyone uses level 1" and etc... no offense intended, 
but I think you need to drink a cold glass of water and actually research these 
things before you say them.

Also, I'm not trying to attack you somehow by providing data about compression 
ratios and mixtures of formats.  I'm sorry if you've somehow gotten that 
impression.

As far as "non-default", that's the whole point of programming.  If everyone 
just said "default is good", Yann would've never created lz4.  Not sure why 
that makes you angry either.  I've provided full source code for everything, 
not some sort of "hand crafted" benchmark.

-[Unknown]

Original comment by unknownb...@gmail.com on 2 Nov 2014 at 4:07

GoogleCodeExporter commented 8 years ago
Oops, I had a bug in the cso reading code introduced when messing with the new 
stuff, it's fixed now.  The ratios above aren't right, but they are right 
relative to each other for 2048, which was the entire point anyway.

Block size 8192 was however majorly affected (the bug caused the file to appear 
more compressible, especially to lz4):

deflate only: 1113812184 (64.88%)
lz4 only: 1263270481 (73.59%)
deflate+lz4: 1113560557 (64.87%)
deflate+lz4+5% lz4 bias: (65.00%)

The 5% bias means that if deflate would be 80% the size, then it will use lz4 
as long as it's <= 85% the size.  If deflate were 10%, it would only use lz4 if 
<= 15%.

Anyway, the trend from 2048 only shows stronger - lz4-only loses just short of 
9 points (9 GB on 100 GB of uncompressed isos.)  A combination even with some 
bias loses very little (0.13 points, so 130 MB on 100 GB of isos.)

I should've known something was funny with that strong decrease, oops.

-[Unknown]

Original comment by unknownb...@gmail.com on 2 Nov 2014 at 6:12

GoogleCodeExporter commented 8 years ago
Based on some basic tests, performance of lz4 vs deflate seems mostly as 
expected.

LZ4 version: https://github.com/Cyan4973/lz4/commit/c0054caa
Deflate version: 6.60 sceKernelDeflateDecompress

I sampled a small assortment of blocks, which of course are not exactly the 
same size but were reasonably close.

Timings are reasonably stable and lz4 takes about 20% as long to decompress at 
a block size of 2048.  At deflate's best, lz4 was 35% the time.

That being said, we're talking about ~60us vs ~300us per block.  I pulled out 
my "Mark2" Sony brand Memstick, and I get ~2800us per 8KB, ~2100us for 2KB, per 
read.  I get faster (~1000us for 8kb, ~700us for 2kb) even over usbhostfs.  My 
class 10 performs only a little better at ~2200us and ~1600us respectively.

So, for a 2KB read, lz4 can improve at most 15%.

In comparison, reading from the umd is of course slow.  About ~3000us per 
random 2kb read after spin up haven't tested 8kb.)  Cached reads are hard to 
measure due to the dcache, but they appear to be in the ~300us range (which is 
notably faster than my ms even for repeated reads.)

I'm not sure if there's a cache cso reads hit before they hit the cso file, I 
assume so.

So, as expected, reducing the read count is sure to have helped much more 
significantly than the compression format, which is great.

That being said, lz4 is faster so it's not a bad thing.  For larger reads, it 
can gain more performance.  Not sure what the codesize cost is (memory.)

-[Unknown]

Original comment by unknownb...@gmail.com on 9 Nov 2014 at 8:48

GoogleCodeExporter commented 8 years ago
https://code.google.com/p/procfw/source/detail?r=4bbef137299d2927ce96f7900a2b001
e2ccabdff

Original comment by devnonam...@gmail.com on 16 Dec 2014 at 10:25