itchio / butler

🎩 Command-line itch.io helper
MIT License
760 stars 54 forks source link

Fix wharf's resumable upload client behavior + collect error traces in case of service degradation #71

Closed quyse closed 8 years ago

quyse commented 8 years ago

I'm getting the following error while trying to push new version of a file.

$ butler push linux quyse/flaw:model-editor-linux
> For channel `model-editor-linux`: last build is 3183, downloading its signature

> Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
*errors.errorString itch.io API error: uploaded file does not exist in storage (build/4513/13440-signature-default: 404)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)

What does it possibly mean? I've recently upgraded docker container which is pushing builds during CI, but I don't think I've changed anything about butler.

quyse commented 8 years ago

Hmm, looks like it's appearing randomly. I tried again and was able to push build successfully, from normal system and then from the same docker container (typing command manually). Then I tried it one more time and got the same error. Now with -v argument for longer logs, file has been already pushed before, so it must be trivial near-zero-sized patch:

[gitlab-runner@5565939828f6 flaw]$ butler -v push linux quyse/flaw:model-editor-linux
> For channel `model-editor-linux`: last build is 4515, downloading its signature
Created signature build file: {ID:13452 UploadURL:https://storage.googleapis.com/itchio/build/4518/13452-signature-default?upload_id=AEnB2Up52INajDXc6Su6gI1kzyfh0uEjsU_KvWdsteFu2xnTdgkG_DISVfEcF2H6ZupLtOd5rLFJrW2_7PVAr05a3ROIkUKoIw UploadParams:map[]}
Created patch build file: {ID:13453 UploadURL:https://storage.googleapis.com/itchio/build/4518/13453-patch-default?upload_id=AEnB2Up8mty9AnguponC-0fiULr8tPoCVZGUbSBwpBoVSOsjkuhcrdBE8OlAhIXxJztZHaQEsxxG_FMDmyJb_5PvJ10ZTDFnpQ UploadParams:map[]}
Launching patch & signature channels
Waiting for source container
Got sourceContainer!

> Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
Building diff context
|                   |   0.00%                                                  [upload 1] kicking off sender
[upload 1] sender blocking receive
[upload 0] kicking off sender
[upload 0] sender blocking receive
[upload 1] flushing buffered writer, 3044 written
[upload 0] flushing buffered writer, 95 written
[upload 1] closing pipe writer
[upload 1] closed pipe writer
[upload 0] closing pipe writer
[upload 1] everything closed! uploadedbytes = 0, totalbytes = 3044
>>>>>>>>>>> woo, got a done
[upload 0] scanner done
[upload 1] scanner done
[upload 0] closed pipe writer
[upload 0] everything closed! uploadedbytes = 0, totalbytes = 95
[upload 0] uploading chunk of 95 bytes
[upload 0] uploading 0-94, last? true
[upload 1] uploading chunk of 3044 bytes
[upload 1] uploading 0-3043, last? true
>>>>>>>>>>> woo, got a done
|####################| 100.00%       0 - almost there                          [upload 1] 3.0 kB uploaded, at 200 OK
[upload 1] sender done
[upload 1] done sent!
>>>>>>>>>>> woo, got a done
|####################| 100.00%       0 - almost there                          [upload 0] 95 B uploaded, at 308 
[upload 0] sender done
[upload 0] done sent!
>>>>>>>>>>> woo, got a done
*errors.errorString itch.io API error: uploaded file does not exist in storage (build/4518/13453-patch-default: 404)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)
leafo commented 8 years ago

This means that the actual upload (coming from your computer) was never completed, but butler thinks it did and tried to continue to the next step. Looking at the our storage bucket I only see a successful upload for the signature file. The patch file is stuck in "uploading" state.

Anything related to your internet by any chance that might cause it to interrupt the upload?

fasterthanlime commented 8 years ago

I have no theory about where the error comes from atm — if it was a logic error in butler, it sounds like it would've popped up before (over 4500 builds pushed for many different games so far).

Maybe some kind of google cloud storage problem? They happen and transparency on them isn't always ideal. I'd be happy if it was a butler bug because it means we could solve the issue once and for all, but for now, I don't know what would actually cause that.

quyse commented 8 years ago

Anything related to your internet by any chance that might cause it to interrupt the upload?

Well, I don't know, my internet connection is usually very stable due to my use of VPN with DigitalOcean machine as output node (so my crappy Russian ISP doesn't interfere with traffic), and I don't see anything suspicious in working of other networked programs i.e. browser, package manager. It's still happening consistently both in docker container and on the host system.

Gitlab pipeline link: https://gitlab.com/quyse/flaw/pipelines/3971019 - see deploys named itch, not a single retry succeeded. I think before yesterday I haven't seen a single butler failure in months.

I just got different error on host machine (and just found the same error yesterday in docker too):

$ ~/.config/itch/bin/butler -v push linux quyse/flaw:model-editor-linux
• For channel `model-editor-linux`: last build is 4537, downloading its signature
Created signature build file: {ID:13514 UploadURL:https://storage.googleapis.com/itchio/build/4539/13514-signature-default?upload_id=AEnB2UorCYSC-Etm7BGT9QBx1w4y8-aE-wYql6YNR6n2Y8QUNLeFkxMXOKKGbBtcqxWuOBcQ0NLoxzlI8SemEP3n9jdcBa6OoA UploadParams:map[]}
Created patch build file: {ID:13515 UploadURL:https://storage.googleapis.com/itchio/build/4539/13515-patch-default?upload_id=AEnB2Uq9HzQDROTkWjfGMaf7wpCIXmoHv5OjO02QtK8CJkRj1GmcGaegwueHvnyX6DEqmDbY1dzrOUoNsefzsiJcmRH7Y-MZMg UploadParams:map[]}
Launching patch & signature channels
Waiting for source container
Got sourceContainer!

• Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
Building diff context
▐                   ▌   0.00%         - network idle, 8.0 MB left              [upload 0] kicking off sender
[upload 1] kicking off sender
[upload 0] sender blocking receive
[upload 1] sender blocking receive
[upload 1] flushing buffered writer, 3044 written
[upload 1] closing pipe writer
[upload 0] flushing buffered writer, 95 written
[upload 0] closing pipe writer
[upload 0] closed pipe writer
[upload 0] everything closed! uploadedbytes = 0, totalbytes = 95
>>>>>>>>>>> woo, got a done
[upload 1] closed pipe writer
[upload 1] everything closed! uploadedbytes = 0, totalbytes = 3044
>>>>>>>>>>> woo, got a done
[upload 1] scanner done
[upload 1] uploading chunk of 3044 bytes
[upload 0] scanner done
[upload 1] uploading 0-3043, last? true
[upload 0] uploading chunk of 95 bytes
[upload 0] uploading 0-94, last? true
▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ 100.00%       0 - almost there                          [upload 1] 3.0 kB uploaded, at 308 
[upload 1] sender done
[upload 1] done sent!
>>>>>>>>>>> woo, got a done
▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ 100.00%       0 - almost there                          [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said 
[upload 0] send error, bailing out
[upload 0] got sub error: HTTP 503 while uploading, bailing
*errors.errorString HTTP 503 while uploading                                    
/home/amos/builds/630f4286/1/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:283 (0x6571cd)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)

503 error cannot be explained as a connectivity problem, isn't it?

fasterthanlime commented 8 years ago

Ah, 503 is definitely a response from the server, and it's something the docs tell us how to deal with: https://cloud.google.com/storage/docs/xml-api/resumable-upload#step_4wzxhzdk17query_title_for_the_upload_status

So I can indeed fix that (and it is indeed the GCS being unstable afaict? some reverse proxy realizing that the backend isn't available probably)

fasterthanlime commented 8 years ago

Hmm interesting, this one is failing because it can't chmod: https://gitlab.com/quyse/flaw/builds/3176071

quyse commented 8 years ago

Hmm interesting, this one is failing because it can't chmod: https://gitlab.com/quyse/flaw/builds/3176071

Yep, that's my fault, as I said, it's freshly rebuilt docker container, and I put butler_creds in there manually as a root, and butler is running under normal user, I fixed that :)

fasterthanlime commented 8 years ago

Right, but I still think butler should only warn about not being able to chmod, not straight up fail :) (if it has read access, that is)

mattiascibien commented 8 years ago

Had the same 503 error over and over now... Seems pretty sistematic.

HoratioZEDU commented 8 years ago

This is so far completely blocking me from pushing almost any builds. I've previously been getting 503 errors that interrupt the upload after a period of "network idle" for pushing larger files, and I get the described error above for smaller files that actually reach 100%. I hope this is fixed soon! The -v log I got for the former case is here: https://gist.github.com/HoratioZEDU/7a53c3637ddc0c9d9b4624c7ecbbdfea

mattiascibien commented 8 years ago

@HoratioZEDU it is the same that happened to me...

fasterthanlime commented 8 years ago

Looking at code right now, we do have built-in retry logic for stuff like socket hangups, spurious dns failures, etc, but not for that particular case. I'll add the list of status codes listed in the GCS code to the retry logic.

VinditCo commented 8 years ago

Hello, also getting this 503 issue - can't push anything. Any suggestions?

fasterthanlime commented 8 years ago

I've written a fix, currently upgrading our build machines to go 1.7 for macOS sierra compat, smaller binaries, and new optimizations. New version will be up in a bit.

fasterthanlime commented 8 years ago

A new bleeding-edge version of butler is now available - can you check that the problem is resolved?

See diff here: https://github.com/itchio/wharf/commit/dfdeafe7dc99552ba817a142a32337289c0dd802

quyse commented 8 years ago

Seems not helping. I just upgraded to latest butler:

$ butler --version
head, built on Aug 19 2016 @ 14:40:20

And retried the same CI build two times, and both times got the same uploaded file does not exist in storage error as in my first post.

VinditCo commented 8 years ago

version : head, built on Aug 19 2016 @ 15:40:26

From verbose mode :

`• Pushing 4.2 GB (803 files, 304 dirs, 0 symlinks) ▐▓ ▌ 1.63% 2h22m11s - network idle, 4.0 GB left
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x41fc0b9]

goroutine 70 [running]: panic(0x446ea20, 0xc420012090) /usr/local/Cellar/go/1.7/libexec/src/runtime/panic.go:500 +0x1a1 github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_netError).Error(0xc42083a9f0, 0x0, 0x0) /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:146 +0x29 github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks.func2(0xc4230a0000, 0x1000000, 0x1000000, 0x0, 0x0, 0x0) /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:224 +0x157 github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks.func3(0xc420466000, 0xc420266a80, 0xc42042e0c0, 0xc42042e120, 0xc420314060, 0xc4203d21e0, 0xc4203d2120) /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:294 +0x5d7 created by github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:305 +0x4c4`

fasterthanlime commented 8 years ago

@VinditCo fixing that now.

@quyse I'm still not sure why that particular one happens. The -v log you posted earlier doesn't show any sign of an upload failing :(

fasterthanlime commented 8 years ago

(Sorry about the slow fixes, I'm fighting against seemingly random CI builder failures..)

fasterthanlime commented 8 years ago

@VinditCo can you try again with the latest bleeding-edge? (butler update --head should do the trick)

edit: actually the bleeding-edge deploy job got skipped in the CI, let me try that again...

VinditCo commented 8 years ago

@fasterthanlime sure thing, pulling and pushing now. Also having a dev hell day, so feel your pain ;)

fasterthanlime commented 8 years ago

@VinditCo see edit above - latest 'head' doesn't have the fix for the last error you reported, 0.15.5 does, pushing a new head now 📦

fasterthanlime commented 8 years ago

Alright, latest head deployed https://github.com/itchio/butler/commit/be3391a9e5f4eedc424bcdd14bc3baca6f89619b

VinditCo commented 8 years ago

Got latest version, seems to stick at % and not increase even when it resumes

`upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 4h5m40s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said

network error: HTTP 503 while uploading Sleeping 1 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 6h32m19s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said

network error: HTTP 503 while uploading Sleeping 4 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 9h3m51s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said

network error: HTTP 503 while uploading Sleeping 9 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 12h36m29s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said

network error: HTTP 503 while uploading Sleeping 16 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 16h4m39s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said

network error: HTTP 503 while uploading Sleeping 25 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 19h11m20s - network idle, 4.1 GB left ^[[A `

HoratioZEDU commented 8 years ago

@VinditCo Can confirm that this appears to be the case.

fasterthanlime commented 8 years ago

I'm running some tests on my side, am definitely noticing things that weren't there before:

Looking at http headers, still investigating..

edit: I'm timing the requests - I get 200 or 308 responses in about 7s-12s, whereas 503 only come after a whopping 1m34

fasterthanlime commented 8 years ago

I have a working theory now — sometimes the servers get only part of a block.

If it's not the last block, you get a 503 (for some reason) — and re-uploading the entire block seems adequate somewhat.

If it's the last block, and you get a 308, that means it didn't receive the last block completely and you might need to resent it (or part of it?)

The resumable protocol wasn't the simplest thing back when the servers weren't throwing 503 but now it's a whole other play...

fasterthanlime commented 8 years ago

@VinditCo @HoratioZEDU @quyse which countries are you respectively uploading from? Come to think of it, wharf's backend uses the same resumable upload code, and it hasn't failed a build because of that particular issue yet.

HoratioZEDU commented 8 years ago

I've been uploading from the UK.

quyse commented 8 years ago

@fasterthanlime My uploads come from Netherlands (DigitalOcean's AMS 3).

fasterthanlime commented 8 years ago

And @VinditCo is in London, I'm in France.. I see a pattern here. @quyse if it's not too much trouble, can you see if you can reproduce the issue in a US zone?

quyse commented 8 years ago

@fasterthanlime No problem, give me a minute

mattiascibien commented 8 years ago

@fasterthanlime Italy here. From a Pretty fast connection.

quyse commented 8 years ago

@fasterthanlime From newly spawned machine (NYC3 zone) - pushed successfully.

fasterthanlime commented 8 years ago

@quyse if you push 5 times in a row, does it all succeed? (taking into account build processing time)

quyse commented 8 years ago

@fasterthanlime Pushed 3 different builds into 3 channels, then re-pushed every build one more time. Everything succeeded, and -v logs show no errors/retries.

So, it means Europe in trouble?

fasterthanlime commented 8 years ago

So, it means Europe in trouble?

Yeah. That would explain why I can't find anybody else complaining about it — we're all pushing to US cloud storage from EU, which not a lot of people do apparently.

In an ideal world, butler would have you push to whichever zone you're closest to (and then we'd have to move it in the backend), but that should only be an optimization, not something that's required to get GCS to work at all :(

fasterthanlime commented 8 years ago

Retrying on 308 seems to make it work. It's still slow and has to retry way more than would be reasonable but it should fix the issue you were having, @quyse (ie. "file does not exist in storage").

seems to stick at % and not increase even when it resumes

that case by @VinditCo seems pathological - many requests failure in a row, probably happened during a particularly bad time for the google servers :( in all my experiments, retrying eventually works (butler will retry up to 15 times, with exponential backoff).


@VinditCo @quyse @mattiascibien @HoratioZEDU Retry on 308 is in latest bleeding-edge build (head), can you folks report on whether it helped or not? (butler update --head should give you the latest, otherwise links are here)

quyse commented 8 years ago

@fasterthanlime Works now! It's quite slow and retries multiple times as you said, but it works. 3 pushes in a row succeeded. Here is log just in case.

fasterthanlime commented 8 years ago

@quyse Great! so at least it does work until google storage service is fully restored. Thanks a bunch for testing!

fasterthanlime commented 8 years ago

I'm closing the issue since butler is now doing the right thing, we're just waiting for the service to be restored fully.

VinditCo commented 8 years ago

Hi everyone, thanks for looking into this! Will try the latest build. Also uploading from the UK - so might be a google thing

IBwWG commented 8 years ago

Just to chip in, in my case it was consistently failing (on win7, linux32, linux64); an update to "bleeding edge butler" (good name for a rock band, btw...called it) does handle it more gracefully. I guess I have a particularly big app, it's still failing, now waiting 25 seconds.

 For channel `windows-beta`: last build is 4423, downloading its signature

 Pushing 407 MB (1472 files, 40 dirs, 0 symlinks)
██████████████████████ 100.00%      0s - almost there
network error: HTTP 308 while sending last block, that's no good.
Sleeping 1 seconds then retrying
██████████████████████ 100.00%      0s - almost there
network error: HTTP 503 while uploading
Sleeping 4 seconds then retrying
██████████████████████ 100.00%      0s - almost there
network error: HTTP 503 while uploading
Sleeping 9 seconds then retrying
██████████████████████ 100.00%      0s - almost there
network error: HTTP 308 while sending last block, that's no good.
Sleeping 16 seconds then retrying
██████████████████████ 100.00%      0s - almost there
...

Anyway thanks for the fix.

VinditCo commented 8 years ago

Anther update on this. Pushing large build from UK, France and Spain (all different team members) fails and doesn't resume correctly. Always hangs at 0.8%.

Using TunnelBear VPN via the USA fixes the problem. Might be a good idea to close down all EU servers and force everyone to USA servers until this starts working again?

Cheers

IBwWG commented 8 years ago

Working OK from Switzerland (eventually, all builds from win7, linux32, and linux64 uploaded correctly.)

fasterthanlime commented 8 years ago

Might be a good idea to close down all EU servers and force everyone to USA servers until this starts working again?

Unfortunately that's not an option — butler always uploads to a US Google Cloud server, it just seems US Google Cloud servers have trouble receiving traffic from Europe at the moment. A workaround we could do is have our own proxy and let butler use that if it sees abnormal error rates when uploading directly to Google Cloud. I was hoping the outage would be short enough that it didn't have to come to that.

fasterthanlime commented 8 years ago

Update: I've reached out to a Google employee, we're trying to find people inside Google close to the Google Cloud Storage team that could look into this particular issue. Hope! 🌠

fasterthanlime commented 8 years ago

Re-opening since the issue is making its way through Google internally - I'll be updating with status reports everytime I get a response from them.

fasterthanlime commented 8 years ago

Current response from Google team:

Hi Amos,

Apologies for the inconvenience here.

We're currently investigating your issue. I'll let you know once I have more information to share. Please hang in there.

quyse commented 8 years ago

22 minutes to push 16 Mb! Google should be ashamed. https://gitlab.com/quyse/flaw/builds/3378034