Open Antar1011 opened 8 years ago
I'd rather batch them all at once.
The problem is, unless you're caching them in-memory, batch-compressing doesn't save you disk i/o. You might want to do both: gzip on generation, then have a nightly job that reads into memory and then compresses en masse.
I know, but it saves disk space which is way more important. 90% compression is a lot better than 60% compression.
You should be able to do both, and my guess is if you do it right it'd be more performant than just doing the nightly compression: I started gzipping my intermediate files for the Smogon Usage Stats scripts, not to save space, but because compressing and writing shorter files turned out to be faster than just writing the uncompressed files.
Obviously you don't want to be compressing already compressed files (although I wouldn't be surprised if one of the compression libraries handles that intelligently if it's used for both file-level and tar-level compression), but you shouldn't have to uncompress them to disk before recompressing.
Not sure what libraries are out there for node, but in python, you can gzip.open
the files and read them to a string buffer, then add them as files to a compressed tar archive using the tarfile
library.
Let me know if you want me to put together a POC demonstration.
That sounds like it'd be significantly slower than the commandline utility...
You can try it if you'd like.
On second thought, this sounds like a good idea.
Update on this. I had the idea for long of using preset dictionaries to get the best of both worlds: high compression ratio for ~16 KB files and immediate compression. Now I got around doing some testing.
I took my February battles dataset, and used it for two different sets of tests. One dealing with the whole month data, and the other only with the OU tier. For presets dictionaries, all the training was done with the corresponding January data.
For baseline and top compression ratio targets, I found the following size reductions:
Ratio | Method | Source data |
---|---|---|
-67.5% | GZip per file at Level 6 | All tiers |
-67.8% | GZip per file at Level 7+ | All tiers |
-87.4% | Tar+GZip at default Level (9) | All tiers |
-97.8% | Tar+LZMA at default Level (probably max, slow as fuck) | All tiers |
-86.5% | Tar+GZip at default Level (9) | OU |
-93.0% | Tar+LZMA at default Level (probably max, slow as fuck) | OU |
Now, enter preset dictionaries. The most viable implementation is rolling a different dictionary in a month-per-month basis, with the data from the month before, to catch up with tier shifts, which would change the metagame state and common strings. There are two ergonomic tools for that as far as I can tell.
One is brought to us by CloudFlare in Dictator, and generates Deflate dictionaries.
The other is by Facebook in a built-in training mode for the Zstandard compression format, for which I have verified there is an appropriate Python package (compat with Antar's side).
Ratio | Method | Dictionary | Data | Requirements |
---|---|---|---|---|
-69.6% | GZip | 16KB (Level 4) | All tiers | Google GO |
-68.7% | zstd | 112 KB (Level 3?) | All tiers, Dictionary sample=100 | zstd CLI |
-76.6% | GZip | 24KB (Level 4) | OU | Google GO |
-73.9% | zstd | 72KB (Level 3?) | OU, Dictionary sample=100 | zstd CLI |
So, I propose using per-tier monthly dictionaries. This proposal turns out to be exactly in-between between the mentioned 60% and 90% compression rates, sitting at ~75% rate, and still compressing every file on write.
Accounting for the log saving changes that would be required for speed and resilience, I have found Zstandard to be far easier to work with than Dictator+GZip. However, the project might still want to go with Dictator if GO is still in the roadmap.
PS. Fuck markdown, why won't the tables render properly?
Worth noting that ext4 often is configured with a block size of 4KB, which means:
Also, zstd compression is rather cheap as far the CPU cost goes. This is likely worth it if we don't have too many files smaller than 4KB.
By the way, @Slayer95, you are missing -|-|-|-|-
between headers and rows which is why table doesn't render. Yes, this |
character is rather important.
too many files smaller than 4KB.
I have found about 3.5% of files are smaller than that.
log saving changes
The plan is to keep a /$month/$tier/pending/
folder for the time window while the new dictionary is being generated, so we may keep that 3.5% of uncompressable files there forever, or create a /$month/$tier/raw/
for them if that approach yields issues.
PS1. Also, I have calculated file size reductions using du -sh
, which as far as I can tell reports the real in-disk storage size.
On the other hand, compressing 5KB files saves 4KB of data.
PS2. I don't think this is kosher... It could at most be saved in a single 4KB block, so it would be a save of at most 1KB. Really, in a per-file basis, the ~75% plus file size reduction would only be accurate for 16KB+ files (which, as it turns out, is also their average size; it seems that the log size distribution has a long tail or something).
By the way, @Slayer95, you are missing -|-|-|-|- between headers and rows which is why table doesn't render. Yes, this | character is rather important.
PS3. Thanks, @xfix !
I'm not sure the compression improvement over Gzip is necessarily worth the complication of using dictionaries?
Can we revisit this? From:
On second thought, this sounds like a good idea.
Originally posted by @Zarel in https://github.com/Zarel/Pokemon-Showdown/issues/2733#issuecomment-299694065
It seems like you were open to gzipping on write (per file), which is fairly trivial to implement and the new stats processing framework handles this already. I can do this as soon as the new stats processing framework has been verified if thats the route youre OK with us taking.
If we were going to compress at a higher level (directories per day or per month), I'd like us to consider using a compression format that is amenable to reading files from without decompressing the entire thing (ie/ pretty sure you need to decompress the entire .tar.gz
before you can read a file from it, as opposed to .zip
where it appears you can figure out the listing and read + decompress specific files from it while still leaving the majority compressed)
As an aside, I still maintain that we should be paying for a cheap 'archival' file storage server (< $20/month on S3, but there are almost certainly better deals) just to serve as file storage, so space doesn't have to be a concern. In the past you mentioned "the problem is more, like, getting the logs from the server onto the data storage", but this should be fairly simply be solved with just a file watcher?
Yes, I'm open to gzipping on write; it was blocked on me not wanting to rewrite the stats processor, and also on "honestly, we should be using TokuDB or MyRocks or some other compressed write-optimized database instead".
The only other issue is that we have a battle log search feature that depends on the data being uncompressed. I think the ideal solution is still the write-optimized database.
If we can merge the battle log and replay databases, that might be ideal. I've been considering that for a long time. Then uploading a replay could just be a matter of setting a flag to "visible".
Yes, I'm open to gzipping on write; it was blocked on me not wanting to rewrite the stats processor,
Would not require a rewrite (though I did one anyway), it literally only requires a 1-3 line change in one of the files of the Python scripts (though I can definitely understand not wanting to dive into that codebase).
If we can merge the battle log and replay databases, that might be ideal. I've been considering that for a long time. Then uploading a replay could just be a matter of setting a flag to "visible".
Having the source of truth for stats be a unified log + replay database is very straightforward (or at least, I will refactor my design a little bit to abstract out log listing/reading behind a Storage
interface so we can just write an adapter to let us switch our input source more easily going forward).
Less clear is whether we'd want to use the same database for processed logs and analysis, but definitely shouldn't be an issue deduping storage such that we no longer store the raw JSONlogs in text files and in the replay database.
As for switching the source database to a 'compressed write optimized database (that also has support for search?)', I'll punt on that discussion for now as well.
Anyway, once my stats processing is done we can decide whether we should continue writing flat files (in which case, ill add compression on write per this bug), or whether we want to just be writing directly to the replay database (in which case i'll write an Storage
adapter for processing stats from the database instead of files).
If we hook up to a replay database, side servers will have problems logging. So the logging code would have to support both.
@Zarel, I know you're concerned both about disk space and I/O bottlenecks. So why not gzip (or bzip or xzip) each log before writing? It's not going to give you the best compression (compared to batch-compressing a bunch of logs at once), but I just tested gzipping a random log, and it reduced the file size from 5211B to 1649B (so 3x?)