How about signature checks after "butler cp" is done?

LennardF1989 commented 7 years ago

My current game has a "role-your-own" type of game update download system (also with patches based on the rsync algorithm), and while this works, I really want to phase it out and simply not support it anymore. So I'm just browsing around the code a little, trying to see if I can wrap my own launcher around butler. Ultimately I want to have my game up in the itch.io client, but also in my own launcher just for this particular game, basically wrapping some of the download/patch functionality itch.io client would.

That said, from my experience with thousands of players having downloaded my games over multiple iterations of the update system, I'm curious why butler doesn't perform any integrity checks post-download? I actually tested this by running "butler cp --resume", cancel it halfway through, purposely mess up the file, then continue downloading. As silly as this may seem, this is an actual problem that occurs a lot!

I think butler would benefit from having a sort of signature check post-download (like during verify/heal), and then, when applicable, try to re-download the parts that got corrupted. Especially in my case where the game comes closed to 700MB in zipped form, having a corrupted zip on your hands and having to redownload the whole thing and hope for the best is problematic for a lot of people.

While this is probably easily added to butler itself, the itch.io API would obviously need to be updated to support this.

fasterthanlime commented 7 years ago

@LennardF1989 butler cp must work with all dumb HTTP servers. Some servers include MD5 hashes in response headers (like google cloud storage), but they only tell you if the whole zip is ok, or if the whole zip is bad (assuming you're downloading a .zip).

butler can do way better actually :)

if you do butler verify ${signature} ${dir} --heal ${archive}, butler will use the signature (hashes of all the blocks of all the files in the zip) to verify that the contents are ok, and if they're not, it'll heal them using the zip (downloading only what it needs).

Although, until we index zips, if one file in the zip is corrupted, it'll download that entire file from the zip (not the entire zip though).

Here's a demo: heal-demo

LennardF1989 commented 7 years ago

Yes, that is exactly the process I imagined would happen after downloading :) However, instead of trying to validate the insides of the ZIP, I would rather make a signature of the ZIP itself too (on top of the usual one)! That way, even if the ZIP-file header is messed up enough to not load, we can still repair it.

Unfortunately, "butler sign" does not accept a ZIP as a file yet, because it will always treat it as an archive and signature the contents instead. If we modify the "butler sign" to signature ZIP as-is by means of a flag, we already solved part of the "issue" :)

EDIT: Basically the flow would be (pseudo-ish):

butler sign <zip-file>
//Upload <zip-file> to <zip-file-signature-url>
butler push <zip-file>
butler cp <zip-file-url> <output>
butler fetch <zip-file-signature-url>
butler verify <output> <zip-file-signature> --heal=<zip-file-url>
//Extract <output> to <extracted-folder>
butler verify <extracted-folder> <extracted-signature>

This is probably a thing that has to happen on the level of the itch.io client (download-ended?), but posting in butler as it obviously has some relation to one another.

fasterthanlime commented 7 years ago

Yes, that is exactly the process I imagined would happen after downloading :) However, instead of trying to validate the insides of the ZIP, I would rather make a signature of the ZIP itself too (on top of the usual one)! That way, even if the ZIP-file header is messed up enough to not load, we can still repair it.

You misunderstand - in the verify command I issue above, the archive argument is an URL, not a downloaded zip file. It's only used if the signature (of the insides - also an URL)` doesn't match what's on the disk.

It works when there's nothing (first run in the demo), when everything's already there (second run), and when parts of it are corrupted (third run in the demo).

Signing the entire zip file is useless - in that case we don't even want to write the zip file to disk, we just access it remotely and directly extract what we need. If you're wondering how butler does that, here's the relevant part of the code: https://github.com/itchio/httpkit/blob/master/httpfile/httpfile.go

LennardF1989 commented 7 years ago

Sorry, I'm mixing things up here and/or just explaining it very poorly. The commands you do I figured out myself too prior to posting and that all works fine and is pretty darn nifty :) I was actually just kind of referring to how itch.io client handles a download.

It either:

Downloads a full zip and extracts
Downloads a patch and applies it

The itch.io client executes a butler cp to download the large ZIP, before handing it to the download-ended.ts reactor and queue it up for installation: https://github.com/itchio/itch/blob/master/src/reactors/downloads/perform-download.ts#L88 https://github.com/itchio/itch/blob/master/src/reactors/downloads/download-ended.ts#L57

My point in the initial post was that a large download, in my experience, is very sensitive to corruption. So I was wondering why you are not validating the ZIP itself using the same techniques as you would any file that's in the ZIP :) In theory that would mean that you would need 2 signature files, one to repair the ZIP-download when it occurs, and another one to repair things post-installation - which solves the issue where people would have to redownload the whole file if the corruption happens to be inside the ZIP-header (preventing it from being opened as such, not too mention potential extraction issues).

The reason I ended up posting here, instead of itch.io client, is because butler currently does not facilitate in signing and verifying a single file. It treats everything as an archive. I'm currently playing with WalkSingle on this line to also support that situation: https://github.com/itchio/butler/blob/master/sign.go#L26

fasterthanlime commented 7 years ago

So, the itch.io client does not utilize all of butler's smarts (except when explicitly asking for a repair), but ideally, instead of downloading a full zip to disk and then extracting it, it would just do a verify+repair, like I've shown you above. (and then, as I mentioned, signing the zip becomes useless).

The reason the itch.io client is not doing that yet, is that if you have a zip file that contains

Many files of a few KiB
One 2GiB file

for example, and the repair stops in the middle of the 2GiB file, then butler's verify command will see that maybe only half of the 2GiB file is missing - but it will still have to access the entire 2GiB file from the remote zip (wasting a bunch of bandwidth).

The fix to that is to index zips (see https://github.com/itchio/butler/blob/master/index_zip.go ), so that repair can grab only parts of a file inside a remote zip.

This is all much more complex than what you originally suggested btw :) If all you want is your zip file to have a hash, you can easily generate it yourself, and check it after you run butler cp.

I'm also explaining poorly because I'm quite tired and it's a complicated subject, but hopefully you see what I mean!

To summarize:

You're talking about the scenario "download zip to disk, extract it to disk". You raise the issue "what if we resume the download and the beginning of the zip on disk is corrupted?" - which is a very valid question and is an issue for the itch.io client at the moment. Actually, butler does check for integrity (here) and redownloads if the zip is corrupted.

But this is a pretty poor solution. If it's a big zip, you have to redownload it all.

I'm talking about the scenario "use a signature to figure out what we need to extract from a remote zip". In an ideal world, we could treat the remote zip like a local folder, grabbing just the blocks we need from any files. In reality (and thanks to httpfile), we can access any files in the zip we want, but we can't skip the beginning of a file. If we want a file in the zip, we have to download it entirely. That's due to how DEFLATE streams work - most decoders aren't concerned with pausing/resuming compression.

The zip_index code I've linked you is a fork of golang's inflater (that decompresses DEFLATE streams) that allows resuming at "savepoints", so that we can extract just part of a file in a zip (whether it's remote or not).

When that code ships, then installing a game from itch.io will be:

Downloading the signature of the contents of a zip + downloading the zip's index
- They're both very small files, a few % of the actual zip
Figuring out what's missing on disk (verify)
Grabbing the parts of the files inside the zip that we need to make the disk copy complete (heal)

That means no temporary zip storage on disk. For repair, we just download the signature again, if everything checks out - good! If there's wounds, just use the (remote) archive again - the zip is still never stored on disk.

Does that make sense to you?

LennardF1989 commented 7 years ago

Yep, it does, we were indeed talking about two things :) I was proposing a fix to the "poor solution" and use butlers smarts to fix that issue, the way you would like to go is obviously much more desirable, which would make my solution a slightly less poor solution compared to the current one :P

I do wonder though, if the ZIP format itself is not flexible enough for this purpose, why not just go with a custom package format where each file is individually compressed? I'm aware you lose the power of shared compression dictionaries making the final size bigger, but from my own tests with very large files (going into the 4GB), the difference between compression a file package as one, or the files inside the package individually is only a handful of MB's. That way, you can simply have a "Name/Offset/CompressedSize/DecompressedSize" kind of index outside the package for the purpose of validating and accessing them without needing the whole file.

PS. I was under the impression you were already downloading files in chunks of 4MB for the sake of faster healing, but after re-reading you said it's useful, not that it's actually there already (https://news.ycombinator.com/item?id=13141268). But I understand right now, that you in all cases need the full zip in order to heal. Meaning that even though we have a 16kb corruption on a 700MB download, it needs the whole thing to fix it. Is my understanding correct?

fasterthanlime commented 7 years ago

which would make my solution a slightly less poor solution compared to the current one :P

Actually (and I didn't realize that at the start of the thread), your suggested solution is the current behavior. It's just implemented in butler (in checks.go) rather than the itch app. And it only handles HTTP headers sent by Google Cloud Storage (although others could easily be added).

[...]

So, the 4MB blocks are something else entirely, and I've kind of abandoned the idea for now.

Using another format than zip is a technical possibility - but since our canonical archive format is zip, if I can find a way to make it work with zip, I will.

But I understand right now, that you in all cases need the full zip in order to heal. Meaning that even though we have a 16kb corruption on a 700MB download, it needs the whole thing to fix it. Is my understanding correct?

Almost, but nope.

The zip format is basically

A dictionary that tells you, for each entry:
- its size, its name, its position in the zip, etc. (so it's like the custom package format you suggested)
And then a list of DEFLATE streams

Actually, the dictionary we care about is at the end of the file, but that's ok.

So, with httpfile (what butler uses for treating HTTP resources as if they were remote files), we parse the dictionary, and then we can seek to any of the entries we want.

However, each entry is a single DEFLATE stream - and most DEFLATE decompressors don't support decompressing only part of a stream - you gotta start the beginning of the stream.

What I'm doing with zip_index is allowing DEFLATE streams decompression to be started from any of the 'checkpoints' in the stream.

Is that any clearer?

LennardF1989 commented 7 years ago

Yes, I got it now. Thank you for your time!

I've been maintaining my current launcher for almost 2.5 years and I really appreciate seeing and talking about different implementations of similar functionality and I have no doubt you have a lot of good in store for itch.io as a platform :)

fasterthanlime commented 7 years ago

Errata: butler cp doesn't do any hash checks. butler dl does, my bad!

fasterthanlime commented 7 years ago

@LennardF1989 thanks a ton for bringing this up, and sorry for dismissing it early on. I forgot that we previously used dl to download files from the itch client, and now use cp. Since cp uses the HTTPFile abstraction, it wasn't aware of hashes and didn't do any integrity checking.

It now does (in master) - I'll release a stable butler after some QA.

fasterthanlime commented 7 years ago

Re-opening since we need to fix some CDN settings for that issue, will close when it actually has effects.

fasterthanlime commented 7 years ago

CDN settings have been adjusted, everything's in-place for us!

LennardF1989 commented 7 years ago

Awesome, thanks for reporting back :) Our discussion kind of went from one side to the other, so I completely lost track of what I initially reported!

itchio / butler

How about signature checks after "butler cp" is done? #123