kensanata / mastodon-archive

Archive your statuses, favorites and media using the Mastodon API (i.e. login required)
https://alexschroeder.ch/software/Mastodon_Archive
GNU General Public License v3.0
362 stars 33 forks source link

Compress JSON #93

Open marc-fouquet opened 1 year ago

marc-fouquet commented 1 year ago

I have just tested this script for the first time and the JSON file with the statuses is astonishingly huge, given that I only have a hand full of toots. There is a lot of redundancy in there, even simple ZIP compression reduces the size by about 90%.

There could be an option --compress that transparently adds compression when loading and saving these files using one of python's buildin compression modules.

kensanata commented 1 year ago

I agree.

lapineige commented 1 year ago

It doesn't matter much given the small size of the resulting text file, but may I advocate for a better compression algorithm than a ZIP ? I'm thinking about Zstandard, it's widely supported now (but not as much as others if not on linux ?), very fast compression/decompression for a very good compression ratio, but anything else is fine :)

kensanata commented 1 year ago

Who ever implements it, gets to decide. 😄

lapineige commented 1 year ago

Oh ok, for some reason I thought you were going to do it :sweat_smile:

I actually don't care much, I store it compressed (filesystem compression using btrfs) anyway.

For the record, a ~1GB archive is compressed to around 100MB (using zstd), which makes quite a big difference :slightly_smiling_face:

kensanata commented 1 year ago

It sure does! I’m basically just storing the results of the Mastodon client calls so every respond contains all the account infos of the author, if I remember correctly. And it’s all pretty printed. So compression definitely helps!

As for myself, I’m just not courageous enough to run a non-standard file system. Ext4 forever, I guess. 😂

lapineige commented 1 year ago

It shouldn't be the point here anyway, it would be great if the json was stored compressed anyway. I will see if I have time to implement this… Don't be too hopeful :sweat_smile:

kensanata commented 1 year ago

I wonder whether this should be optional (or automatic: detect if a .gz variant already exists, and if it does, use that). I don't have a compressed filesystem, but if I had, I'm assuming I wouldn't want to have the data recompressed?

lapineige commented 1 year ago

I'm not sure it's a big deal. And most of the time the filesystem detects it's already compressed (in fact : not possible to compress) and skip it.

Also, it's a quite rare use case.