[Suggestion] Upload / Download Compressed Databases

chrisjlocke commented 7 years ago

Just a thought, but if I wanted a 68 MB database, that could be a long transfer. Uploading a 68 MB database is even slower - uploads are (normally) more slower than downloads. SQLite databases, by their nature, are relatively text-heavy. Is there any way a database can be zipped (or a Linux equivelant?) prior to uploading/downloading? Uploading an 18 MB .zip would be easier than a bloaty 68 MB database... but there is a barrowload of work to do 'server side' to accept ZIPs.... plus the potential for trouble.... Anyway. Thought I'd throw it out there.

justinclift commented 7 years ago

Ahhh, yeah. That's a good idea too.

With DB4S, it shouldn't be too hard to get that happening. I just tried compressing the DB4S download stats database (68MB), using bzip2 -9, and that compressed down to 12MB.

With browser uploads though, that might take some more investigation. It'll likely have something to do with negotiation between the browser and server, such that the browser knows it can compress uploads on the fly. Do you have any interest in investigating that? :smile:

chrisjlocke commented 7 years ago

such that the browser knows it can compress uploads on the fly.

I didn't know if it was easier allowing users to upload .ZIP files, rather than add an extra layer of browser incompatibility... someone mgbt be using an older browser, causing weird freaky errors. But if they just threw up a .zip, dbhub.io could uncomroeess that.

I'll have an investigation though. Will rummahe around...

justinclift commented 7 years ago

Oops. Oh yeah, uploading .zips should be easy to implement. We'd want to be careful so that the unzip code checks for zip bombs first, but apart from that it should be workable. :smile:

justinclift commented 7 years ago

From a quick look, protecting against zip bombs shouldn't be too hard.

https://stackoverflow.com/questions/1459080/how-can-i-protect-myself-from-a-zip-bomb

Something along the lines of:

Read the entries in the .zip file
Ensuring there's only database file in there
- Hmmm, if there's a README or similar that'd make for an easy way to get meta-data filled out
Ensure the size of the entries aren't "over size" according to whatever limitations are set for the server at the time

justinclift commented 7 years ago

Some variation of this looks like it could be useful for generating test data to make sure things are ok:

https://github.com/abdulfatir/ZipBomb

justinclift commented 7 years ago

Hmmm, apparently the size headers for entries can be mucked around with to give incorrect sizes. 😦

https://github.com/thejoshwolfe/yauzl/issues/13#issuecomment-96789940

chrisjlocke commented 7 years ago

You have to admire people to 'break things' so well. The makers of GTA were always impressed people had done things with the game they'd never thought of. Creating a 30 KB file that uncompresses to such a huge size.... argh! You have to code everthing so tightly to ensure the cheeky gits can't worm their way in.

justinclift commented 7 years ago

Yep. :wink:

For a naive first implementation, probably streaming the uncompressed data nowhere on a first run, just counting the output size would work. With a limiter that stops somewhere useful (100MB is the current max per upload on the server). If it finishes ok without hitting the limit, then do the uncompress for real.

chrisjlocke commented 7 years ago

You'd still have to be careful (or delete files after). If it stopped after 100 MB, Mr Malicious Git could perform 20 simultaneous uploads, wiping out disk space, etc.

I tried downloading the ZIP file on that gitHub site, but it uncompressed to 10000000000 GB... ;)

justinclift commented 7 years ago

Yep. We'll figure it out. :smile:

justinclift commented 7 years ago

Just tried out a few compression formats, as potentials for providing to people via the download button.

The source data is the 2013 (only) ~710MB Backblaze drive stats database:

https://db4s-beta.dbhub.io/justinclift/backblaze-drive_stats.db?commit=bd20928c99826ad56430ff517465bac7d179c09381309dd11256f696e03aebf0

gzip

$ time gzip -9k backblaze-drive_stats.db

real    0m28.655s
user    0m26.870s
sys 0m0.799s

$ ls -lh backblaze-drive_stats.db.gz
-rw-r--r--. 1 jc jc 93M Aug 31 15:34 backblaze-drive_stats.db.gz

zip

$ time zip -9 backblaze-drive_stats.db.zip backblaze-drive_stats.db
  adding: backblaze-drive_stats.db (deflated 87%)

real    0m24.254s
user    0m23.962s
sys 0m0.244s

$ ls -lh backblaze-drive_stats.db.zip 
-rw-rw-r--. 1 jc jc 93M Sep  4 14:25 backblaze-drive_stats.db.zip

bzip2

$ time bzip2 -9k backblaze-drive_stats.db 

real    1m40.247s
user    1m38.432s
sys 0m0.594s

$ ls -lh backblaze-drive_stats.db.bz2 
-rw-r--r--. 1 jc jc 65M Aug 31 15:34 backblaze-drive_stats.db.bz2

xz (multi-threaded mode on an old quad core i5-750)

$ time xz -k -9e -T 0 -vv backblaze-drive_stats.db
xz: Filter chain: --lzma2=dict=64MiB,lc=3,lp=0,pb=2,mode=normal,nice=273,mf=bt4,depth=512
xz: Using up to 4 threads.
xz: 4,997 MiB of memory is required. The limiter is disabled.
xz: Decompression will need 65 MiB of memory.
backblaze-drive_stats.db (1/1)
  100 %        41.5 MiB / 713.0 MiB = 0.058   3.1 MiB/s       3:51             

real    3m51.213s
user    13m19.875s
sys 0m4.855s

$ ls -lh backblaze-drive_stats.db.xz
-rw-r--r--. 1 jc jc 42M Aug 31 15:34 backblaze-drive_stats.db

xz (single threaded mode on the same old quad core i5-750)

$ time xz -k -9e -vv backblaze-drive_stats.db
xz: Filter chain: --lzma2=dict=64MiB,lc=3,lp=0,pb=2,mode=normal,nice=273,mf=bt4,depth=512
xz: 674 MiB of memory is required. The limiter is disabled.
xz: Decompression will need 65 MiB of memory.
backblaze-drive_stats.db (1/1)
  100 %        41.3 MiB / 713.0 MiB = 0.058   909 KiB/s      13:23             

real    13m23.504s
user    13m21.292s
sys 0m0.761s

$ ls -lh backblaze-drive_stats.db.xz
-rw-r--r--. 1 jc jc 42M Aug 31 15:34 backblaze-drive_stats.db.xz

p7zip (*nix version of 7zip)

$ time 7za a -t7z -m0=lzma2 -mx=9 -mmt=2 -mfb=64 -md=64m -ms=on backblaze-drive_stats.db.7z backblaze-drive_stats.db

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Core(TM) i5 CPU         750  @ 2.67GHz (106E5),ASM)

Scanning the drive:
1 file, 747593728 bytes (713 MiB)

Creating archive: backblaze-drive_stats.db.7z

Items to compress: 1

Files read from disk: 1
Archive size: 50809124 bytes (49 MiB)
Everything is Ok

real    4m52.235s
user    6m50.903s
sys 0m10.461s

$ ls -lh backblaze-drive_stats.db.7z
-rw-rw-r--. 1 jc jc 49M Sep  4 15:46 backblaze-drive_stats.db.7z

Compression wise, xz in multi-threaded mode looks like the winner.

chrisjlocke commented 7 years ago

Never heard of xz. (but I'm on Windows, so .ZIP is god!) 7zip is munching ahead, but .rar is semi-god. I presume this depends on the OS... someone isn't going to install something just to upload a database.

justinclift commented 7 years ago

Yeah. In the *nix world "gzip" (not compatible with .zip) and "bzip" are very widespread. "xz" uses the same compression as 7zip, but a different on disk format.

For upload, I guess we should support .zip at the very least. For downloading, I'm thinking ".zip" (widespread) and "xz" (for maximum compression option).

iKlsR commented 7 years ago

See https://www.blender.org/download/ as an example of how you could provide formats. I usually go for a good .tar.bz2 when available and fallback to zip which should always be available. As Chris says nobody is going to download 7zip to get access to 1mb super compressed file faster when the 10mb zip is there.

justinclift commented 7 years ago

@iKlsR Ahhh, that seems like a reasonable approach. Thanks. :smile:

justinclift commented 4 years ago

@MKleusberg @chrisjlocke As a data point, I just added gzip compression support to the DB4S connection handler code running on DBHub.io.

It seems to be working fine, and it also seems to speed transfers up a lot. :smile:

MKleusberg commented 4 years ago

It does however look like the progress bar in DB4S doesn't work anymore. Probably because the download size isn't known in advance.

justinclift commented 4 years ago

Oh. Damn. We might need to use curl or something to capture the response headers and see what's being transferred. I was under the impression it'd be "transparent" so wouldn't change anything. :frowning:

justinclift commented 4 years ago

As an alternative thought, instead of doing streaming compression it might be feasible to gzip compress the requested database on the fly completely first, then transfer the compressed file.

That might allow for the transfer to give an accurate count to the # of bytes expected to be sent over the wire.

For small databases (eg most of them), there difference in start time for the download wouldn't be noticeable. Larger ones though... it probably would be.

We should probably investigate what info is actually being sent already first, and in what order, just in case there's a way for the streaming approach to work. :smile:

sqlitebrowser / dbhub.io