fix splitting - Githubissues

goblin commented 8 years ago

I've just tried to convert enwiki to use a 128MB bin size (which seems more reasonable than the tiny default 384k), and after waiting about 8 hours for the "Mapping blob to keys" step I noticed the script started creating tens of thousands of small, few-kB .slob files. This defeated the purpose of having a larger bin size.

I believe this PR might fix the issue. I didn't pass any -s argument as I wanted one large output file, however it seems that by default the variable split gets initialized to -1 (in line 1968). So if I'm thinking correctly, the line 2037 is meant to test whether the user wanted to split the file. Since split == -1 by default, it will pretty much always match and will always split the output (-1 looks to be a true value for python).

itkach commented 8 years ago

I've just tried to convert enwiki to use a 128MB bin size (which seems more reasonable than the tiny default 384k),

I'm not sure why do you say 128MB makes sense at all. Bin is several content blobs compressed together (single blob such as one Wikipedia article is typically not large enough to achieve good compression ratio). Whenever blob needs to be extracted it's entire bin is decompressed. Having a bin size of 128MB means you'd be decompressing a pretty chunky archive, inflating it potentially x8-10 times (in memory of your phone, for example) just to access when item in it. Now THAT is defeating the purpose of this whole thing for sure.

goblin commented 8 years ago

OK, sorry, I should've specified it's perhaps a matter of taste.

The reasoning behind a much larger bin size is reduction of the size of the slob at the cost of decompression time. I'd be happy to wait 3-5 seconds for an article to be decompressed on my phone if I could cut down the size of the archive from 12 down to 9GB (I haven't checked the exact numbers yet though, but it certainly seems doable).

384kB is much smaller than the default LZMA dictionary size (8MiB AFAIK), and it feels like LZMA doesn't quite get enough data to really shine. The perfect binsize would probably be somewhere between the two extremes.

I don't agree that you'd have to decompress the entire bin into memory - you could just discard data as you go, only saving the required article. Sure it'd require extra CPU, but the memory requirement should stay the same.

It's an interesting discussion and I'll certainly be running more tests to verify my claims :-)

However the original PR is just related to the unexpected splitting of the resulting slob.

itkach commented 8 years ago

I don't agree that you'd have to decompress the entire bin into memory

Maybe you don't have to, but this is how current implementation works. So sufficiently large bin (way smaller than 128MB) will result in out of memory.

Even if memory was not an issue, time still is. You may be willing to wait many seconds, but to me this is unacceptable and unusable.

goblin commented 8 years ago

Even if memory was not an issue, time still is. You may be willing to wait many seconds, but to me this is unacceptable and unusable.

Sure, fair enough -- I really didn't mean advocating changing the default binsize, it was just a comment meant to say it was a bit too small for my taste.

itkach commented 8 years ago

I see the issue with split value handling. I think I'd like to fix it in a slightly different way though - just get rid of -1 when converting split size to bytes.

goblin commented 8 years ago

No problem, feel free to fix whichever way you like :-) The PR was only meant as a suggestion of the fix.

itkach commented 8 years ago

Fixed in 504f280ab8c5d5397d74928becdc2c8c97937977

itkach / slob

fix splitting #10