Closed quyse closed 8 years ago
Hmm, looks like it's appearing randomly. I tried again and was able to push build successfully, from normal system and then from the same docker container (typing command manually). Then I tried it one more time and got the same error. Now with -v
argument for longer logs, file has been already pushed before, so it must be trivial near-zero-sized patch:
[gitlab-runner@5565939828f6 flaw]$ butler -v push linux quyse/flaw:model-editor-linux
> For channel `model-editor-linux`: last build is 4515, downloading its signature
Created signature build file: {ID:13452 UploadURL:https://storage.googleapis.com/itchio/build/4518/13452-signature-default?upload_id=AEnB2Up52INajDXc6Su6gI1kzyfh0uEjsU_KvWdsteFu2xnTdgkG_DISVfEcF2H6ZupLtOd5rLFJrW2_7PVAr05a3ROIkUKoIw UploadParams:map[]}
Created patch build file: {ID:13453 UploadURL:https://storage.googleapis.com/itchio/build/4518/13453-patch-default?upload_id=AEnB2Up8mty9AnguponC-0fiULr8tPoCVZGUbSBwpBoVSOsjkuhcrdBE8OlAhIXxJztZHaQEsxxG_FMDmyJb_5PvJ10ZTDFnpQ UploadParams:map[]}
Launching patch & signature channels
Waiting for source container
Got sourceContainer!
> Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
Building diff context
| | 0.00% [upload 1] kicking off sender
[upload 1] sender blocking receive
[upload 0] kicking off sender
[upload 0] sender blocking receive
[upload 1] flushing buffered writer, 3044 written
[upload 0] flushing buffered writer, 95 written
[upload 1] closing pipe writer
[upload 1] closed pipe writer
[upload 0] closing pipe writer
[upload 1] everything closed! uploadedbytes = 0, totalbytes = 3044
>>>>>>>>>>> woo, got a done
[upload 0] scanner done
[upload 1] scanner done
[upload 0] closed pipe writer
[upload 0] everything closed! uploadedbytes = 0, totalbytes = 95
[upload 0] uploading chunk of 95 bytes
[upload 0] uploading 0-94, last? true
[upload 1] uploading chunk of 3044 bytes
[upload 1] uploading 0-3043, last? true
>>>>>>>>>>> woo, got a done
|####################| 100.00% 0 - almost there [upload 1] 3.0 kB uploaded, at 200 OK
[upload 1] sender done
[upload 1] done sent!
>>>>>>>>>>> woo, got a done
|####################| 100.00% 0 - almost there [upload 0] 95 B uploaded, at 308
[upload 0] sender done
[upload 0] done sent!
>>>>>>>>>>> woo, got a done
*errors.errorString itch.io API error: uploaded file does not exist in storage (build/4518/13453-patch-default: 404)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)
This means that the actual upload (coming from your computer) was never completed, but butler thinks it did and tried to continue to the next step. Looking at the our storage bucket I only see a successful upload for the signature file. The patch file is stuck in "uploading" state.
Anything related to your internet by any chance that might cause it to interrupt the upload?
I have no theory about where the error comes from atm — if it was a logic error in butler, it sounds like it would've popped up before (over 4500 builds pushed for many different games so far).
Maybe some kind of google cloud storage problem? They happen and transparency on them isn't always ideal. I'd be happy if it was a butler bug because it means we could solve the issue once and for all, but for now, I don't know what would actually cause that.
Anything related to your internet by any chance that might cause it to interrupt the upload?
Well, I don't know, my internet connection is usually very stable due to my use of VPN with DigitalOcean machine as output node (so my crappy Russian ISP doesn't interfere with traffic), and I don't see anything suspicious in working of other networked programs i.e. browser, package manager. It's still happening consistently both in docker container and on the host system.
Gitlab pipeline link: https://gitlab.com/quyse/flaw/pipelines/3971019 - see deploys named itch, not a single retry succeeded. I think before yesterday I haven't seen a single butler failure in months.
I just got different error on host machine (and just found the same error yesterday in docker too):
$ ~/.config/itch/bin/butler -v push linux quyse/flaw:model-editor-linux
• For channel `model-editor-linux`: last build is 4537, downloading its signature
Created signature build file: {ID:13514 UploadURL:https://storage.googleapis.com/itchio/build/4539/13514-signature-default?upload_id=AEnB2UorCYSC-Etm7BGT9QBx1w4y8-aE-wYql6YNR6n2Y8QUNLeFkxMXOKKGbBtcqxWuOBcQ0NLoxzlI8SemEP3n9jdcBa6OoA UploadParams:map[]}
Created patch build file: {ID:13515 UploadURL:https://storage.googleapis.com/itchio/build/4539/13515-patch-default?upload_id=AEnB2Uq9HzQDROTkWjfGMaf7wpCIXmoHv5OjO02QtK8CJkRj1GmcGaegwueHvnyX6DEqmDbY1dzrOUoNsefzsiJcmRH7Y-MZMg UploadParams:map[]}
Launching patch & signature channels
Waiting for source container
Got sourceContainer!
• Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
Building diff context
▐ ▌ 0.00% - network idle, 8.0 MB left [upload 0] kicking off sender
[upload 1] kicking off sender
[upload 0] sender blocking receive
[upload 1] sender blocking receive
[upload 1] flushing buffered writer, 3044 written
[upload 1] closing pipe writer
[upload 0] flushing buffered writer, 95 written
[upload 0] closing pipe writer
[upload 0] closed pipe writer
[upload 0] everything closed! uploadedbytes = 0, totalbytes = 95
>>>>>>>>>>> woo, got a done
[upload 1] closed pipe writer
[upload 1] everything closed! uploadedbytes = 0, totalbytes = 3044
>>>>>>>>>>> woo, got a done
[upload 1] scanner done
[upload 1] uploading chunk of 3044 bytes
[upload 0] scanner done
[upload 1] uploading 0-3043, last? true
[upload 0] uploading chunk of 95 bytes
[upload 0] uploading 0-94, last? true
▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ 100.00% 0 - almost there [upload 1] 3.0 kB uploaded, at 308
[upload 1] sender done
[upload 1] done sent!
>>>>>>>>>>> woo, got a done
▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ 100.00% 0 - almost there [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said
[upload 0] send error, bailing out
[upload 0] got sub error: HTTP 503 while uploading, bailing
*errors.errorString HTTP 503 while uploading
/home/amos/builds/630f4286/1/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:283 (0x6571cd)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)
503 error cannot be explained as a connectivity problem, isn't it?
Ah, 503 is definitely a response from the server, and it's something the docs tell us how to deal with: https://cloud.google.com/storage/docs/xml-api/resumable-upload#step_4wzxhzdk17query_title_for_the_upload_status
So I can indeed fix that (and it is indeed the GCS being unstable afaict? some reverse proxy realizing that the backend isn't available probably)
Hmm interesting, this one is failing because it can't chmod: https://gitlab.com/quyse/flaw/builds/3176071
Hmm interesting, this one is failing because it can't chmod: https://gitlab.com/quyse/flaw/builds/3176071
Yep, that's my fault, as I said, it's freshly rebuilt docker container, and I put butler_creds in there manually as a root, and butler is running under normal user, I fixed that :)
Right, but I still think butler should only warn about not being able to chmod, not straight up fail :) (if it has read access, that is)
Had the same 503 error over and over now... Seems pretty sistematic.
This is so far completely blocking me from pushing almost any builds. I've previously been getting 503 errors that interrupt the upload after a period of "network idle" for pushing larger files, and I get the described error above for smaller files that actually reach 100%. I hope this is fixed soon! The -v log I got for the former case is here: https://gist.github.com/HoratioZEDU/7a53c3637ddc0c9d9b4624c7ecbbdfea
@HoratioZEDU it is the same that happened to me...
Looking at code right now, we do have built-in retry logic for stuff like socket hangups, spurious dns failures, etc, but not for that particular case. I'll add the list of status codes listed in the GCS code to the retry logic.
Hello, also getting this 503 issue - can't push anything. Any suggestions?
I've written a fix, currently upgrading our build machines to go 1.7 for macOS sierra compat, smaller binaries, and new optimizations. New version will be up in a bit.
A new bleeding-edge version of butler is now available - can you check that the problem is resolved?
See diff here: https://github.com/itchio/wharf/commit/dfdeafe7dc99552ba817a142a32337289c0dd802
Seems not helping. I just upgraded to latest butler:
$ butler --version
head, built on Aug 19 2016 @ 14:40:20
And retried the same CI build two times, and both times got the same uploaded file does not exist in storage
error as in my first post.
version : head, built on Aug 19 2016 @ 15:40:26
From verbose mode :
`• Pushing 4.2 GB (803 files, 304 dirs, 0 symlinks)
▐▓ ▌ 1.63% 2h22m11s - network idle, 4.0 GB left
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x41fc0b9]
goroutine 70 [running]: panic(0x446ea20, 0xc420012090) /usr/local/Cellar/go/1.7/libexec/src/runtime/panic.go:500 +0x1a1 github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_netError).Error(0xc42083a9f0, 0x0, 0x0) /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:146 +0x29 github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks.func2(0xc4230a0000, 0x1000000, 0x1000000, 0x0, 0x0, 0x0) /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:224 +0x157 github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks.func3(0xc420466000, 0xc420266a80, 0xc42042e0c0, 0xc42042e120, 0xc420314060, 0xc4203d21e0, 0xc4203d2120) /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:294 +0x5d7 created by github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks /Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:305 +0x4c4`
@VinditCo fixing that now.
@quyse I'm still not sure why that particular one happens. The -v log you posted earlier doesn't show any sign of an upload failing :(
(Sorry about the slow fixes, I'm fighting against seemingly random CI builder failures..)
@VinditCo can you try again with the latest bleeding-edge? (butler update --head
should do the trick)
edit: actually the bleeding-edge deploy job got skipped in the CI, let me try that again...
@fasterthanlime sure thing, pulling and pushing now. Also having a dev hell day, so feel your pain ;)
@VinditCo see edit above - latest 'head' doesn't have the fix for the last error you reported, 0.15.5 does, pushing a new head now 📦
Alright, latest head deployed https://github.com/itchio/butler/commit/be3391a9e5f4eedc424bcdd14bc3baca6f89619b
Got latest version, seems to stick at % and not increase even when it resumes
`upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 4h5m40s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said
network error: HTTP 503 while uploading Sleeping 1 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 6h32m19s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said
network error: HTTP 503 while uploading Sleeping 4 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 9h3m51s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said
network error: HTTP 503 while uploading Sleeping 9 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 12h36m29s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said
network error: HTTP 503 while uploading Sleeping 16 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 16h4m39s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable [upload 0] server said
network error: HTTP 503 while uploading Sleeping 25 seconds then retrying [upload 0] uploading chunk of 16777216 bytes [upload 0] uploading 16777216-33554431, last? false ▐▓ ▌ 0.81% 19h11m20s - network idle, 4.1 GB left ^[[A `
@VinditCo Can confirm that this appears to be the case.
I'm running some tests on my side, am definitely noticing things that weren't there before:
Looking at http headers, still investigating..
edit: I'm timing the requests - I get 200 or 308 responses in about 7s-12s, whereas 503 only come after a whopping 1m34
I have a working theory now — sometimes the servers get only part of a block.
If it's not the last block, you get a 503 (for some reason) — and re-uploading the entire block seems adequate somewhat.
If it's the last block, and you get a 308, that means it didn't receive the last block completely and you might need to resent it (or part of it?)
The resumable protocol wasn't the simplest thing back when the servers weren't throwing 503 but now it's a whole other play...
@VinditCo @HoratioZEDU @quyse which countries are you respectively uploading from? Come to think of it, wharf's backend uses the same resumable upload code, and it hasn't failed a build because of that particular issue yet.
I've been uploading from the UK.
@fasterthanlime My uploads come from Netherlands (DigitalOcean's AMS 3).
And @VinditCo is in London, I'm in France.. I see a pattern here. @quyse if it's not too much trouble, can you see if you can reproduce the issue in a US zone?
@fasterthanlime No problem, give me a minute
@fasterthanlime Italy here. From a Pretty fast connection.
@fasterthanlime From newly spawned machine (NYC3 zone) - pushed successfully.
@quyse if you push 5 times in a row, does it all succeed? (taking into account build processing time)
@fasterthanlime Pushed 3 different builds into 3 channels, then re-pushed every build one more time. Everything succeeded, and -v
logs show no errors/retries.
So, it means Europe in trouble?
So, it means Europe in trouble?
Yeah. That would explain why I can't find anybody else complaining about it — we're all pushing to US cloud storage from EU, which not a lot of people do apparently.
In an ideal world, butler would have you push to whichever zone you're closest to (and then we'd have to move it in the backend), but that should only be an optimization, not something that's required to get GCS to work at all :(
Retrying on 308 seems to make it work. It's still slow and has to retry way more than would be reasonable but it should fix the issue you were having, @quyse (ie. "file does not exist in storage").
seems to stick at % and not increase even when it resumes
that case by @VinditCo seems pathological - many requests failure in a row, probably happened during a particularly bad time for the google servers :( in all my experiments, retrying eventually works (butler will retry up to 15 times, with exponential backoff).
@VinditCo @quyse @mattiascibien @HoratioZEDU Retry on 308 is in latest bleeding-edge build (head), can you folks report on whether it helped or not? (butler update --head
should give you the latest, otherwise links are here)
@fasterthanlime Works now! It's quite slow and retries multiple times as you said, but it works. 3 pushes in a row succeeded. Here is log just in case.
@quyse Great! so at least it does work until google storage service is fully restored. Thanks a bunch for testing!
I'm closing the issue since butler is now doing the right thing, we're just waiting for the service to be restored fully.
Hi everyone, thanks for looking into this! Will try the latest build. Also uploading from the UK - so might be a google thing
Just to chip in, in my case it was consistently failing (on win7, linux32, linux64); an update to "bleeding edge butler" (good name for a rock band, btw...called it) does handle it more gracefully. I guess I have a particularly big app, it's still failing, now waiting 25 seconds.
For channel `windows-beta`: last build is 4423, downloading its signature
Pushing 407 MB (1472 files, 40 dirs, 0 symlinks)
██████████████████████ 100.00% 0s - almost there
network error: HTTP 308 while sending last block, that's no good.
Sleeping 1 seconds then retrying
██████████████████████ 100.00% 0s - almost there
network error: HTTP 503 while uploading
Sleeping 4 seconds then retrying
██████████████████████ 100.00% 0s - almost there
network error: HTTP 503 while uploading
Sleeping 9 seconds then retrying
██████████████████████ 100.00% 0s - almost there
network error: HTTP 308 while sending last block, that's no good.
Sleeping 16 seconds then retrying
██████████████████████ 100.00% 0s - almost there
...
Anyway thanks for the fix.
Anther update on this. Pushing large build from UK, France and Spain (all different team members) fails and doesn't resume correctly. Always hangs at 0.8%.
Using TunnelBear VPN via the USA fixes the problem. Might be a good idea to close down all EU servers and force everyone to USA servers until this starts working again?
Cheers
Working OK from Switzerland (eventually, all builds from win7, linux32, and linux64 uploaded correctly.)
Might be a good idea to close down all EU servers and force everyone to USA servers until this starts working again?
Unfortunately that's not an option — butler always uploads to a US Google Cloud server, it just seems US Google Cloud servers have trouble receiving traffic from Europe at the moment. A workaround we could do is have our own proxy and let butler use that if it sees abnormal error rates when uploading directly to Google Cloud. I was hoping the outage would be short enough that it didn't have to come to that.
Update: I've reached out to a Google employee, we're trying to find people inside Google close to the Google Cloud Storage team that could look into this particular issue. Hope! 🌠
Re-opening since the issue is making its way through Google internally - I'll be updating with status reports everytime I get a response from them.
Current response from Google team:
Hi Amos,
Apologies for the inconvenience here.
We're currently investigating your issue. I'll let you know once I have more information to share. Please hang in there.
22 minutes to push 16 Mb! Google should be ashamed. https://gitlab.com/quyse/flaw/builds/3378034
I'm getting the following error while trying to push new version of a file.
What does it possibly mean? I've recently upgraded docker container which is pushing builds during CI, but I don't think I've changed anything about butler.