Closed omehes closed 4 years ago
I have observed this issue as well a few times while running the bake
command. I did find the following Concourse issues which seem to be related:
zstd
compression opt-in; default to gzip
These all have to do with the underlying use of compression on the uploads during fly execute
. I did take the example from the comment in Concourse issue 4621 and was able to get a fairly reproducible error running that against the bakery cli
concourse server, though I had to use a slightly larger file size of 512k vs 128k:
uploading source ⠼ (1.1KiB/s)
error: 'uploading source' failed: Unexpected Response
Status: 500 Internal Server Error
Body:
NOTE: To run that test, I temporarily commented the code in execute-js
that tears down the docker-compose
environment so that I could experiment with it outside of the task invocations.
Corresponding log entries from Concourse:
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
{"timestamp":"2020-04-21T15:28:01.721841815Z","level":"info","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.bad-stream-payload","data":{"error":"exit status 2","session":"5.1.2629","volume":"2ec66010-867e-415f-5d96-7901295d5778"}}
{"timestamp":"2020-04-21T15:28:01.723077003Z","level":"error","source":"atc","message":"atc.create-artifact.failed-to-stream-volume-contents","data":{"error":"failed to stream in to volume","session":"10630"}}
The switch to zstd
seems to have happened on version 5.4
: https://github.com/concourse/concourse/releases/tag/v5.4.0
Wow, good find on that GitHub issue. I searched for a long time without anything to show for it.
This is now blocking further CLI + CLI driven testing work on output-producer-service. We have determined that a fix or workaround for this bug is necessary before the CLI or the tests that use it can be considered meaningful.
Looks like that PR you linked has recent activity. That's a good sign.
I'm documenting some of the testing / investigations that I did as part of this issue, but the TLDR is it seems a fix / work around is in addition to waiting for the Concourse port to become responsive, we should wait a second for it to "settle". In other words, the following seems to fix the observed behavior:
diff --git a/bakery/src/cli/execute.js b/bakery/src/cli/execute.js
index 3cf0680..7bcedad 100644
--- a/bakery/src/cli/execute.js
+++ b/bakery/src/cli/execute.js
@@ -84,6 +84,9 @@ const flyExecute = async cmdArgs => {
console.log('waiting for concourse to wake up')
await waitPort({ protocol: 'http', host: 'localhost', port: 8080, path: '/api/v1/info', timeout: 120_000 })
+ console.log('waiting for concourse to -- "settle")
+ sleep.sleep(1)
+
console.log('logging in')
const login = spawn('fly', [
'login',
If that leaves you somewhat unsatisfied and / or asking yourself "what kind of solution is that???", all I can say is it's the one that solved the problem I was left with after I solved the root problem (I thought) I discovered. You may find some solace in reading the remainder of this post...or not...
The general approach I took initially was to construct a synthetic example that captured the issue and assume it correlated to what we were experiencing. The idea being fixing the former would address the latter. And away we go...
Based upon this comment, I created a local environment outside of the CLI execution consisting of a basic docker-compose.yml
(copied from execute.js
) and a simple dummy task.
Simple docker-compose.yml
file for spinning up Concourse and Postgres:
---
version: "3"
services:
concourse-db:
image: postgres
environment:
- POSTGRES_DB=concourse
- POSTGRES_PASSWORD=concourse_pass
- POSTGRES_USER=concourse_user
- PGDATA=/database
concourse:
image: concourse/concourse:6.0
command: quickstart
privileged: true
depends_on: [concourse-db]
ports: ["8080:8080"]
environment:
- CONCOURSE_POSTGRES_HOST=concourse-db
- CONCOURSE_POSTGRES_USER=concourse_user
- CONCOURSE_POSTGRES_PASSWORD=concourse_pass
- CONCOURSE_POSTGRES_DATABASE=concourse
- CONCOURSE_EXTERNAL_URL
- CONCOURSE_ADD_LOCAL_USER=admin:admin
- CONCOURSE_MAIN_TEAM_LOCAL_USER=admin
That can be spun up per the following:
docker-compose -f /openstax/output-producer-service/docker-compose-2296.yml up -d
./fly login -k -t bakery-cli -c http://127.0.0.1:8080 -u admin -p admin
./fly -t bakery-cli sync
Simple test task.yml
used to exercise the server:
---
platform: linux
image_resource:
type: docker-image
source:
repository: bash
inputs:
- name: book
run:
path: bash
args:
- -c
- |-
echo "Hello World"
Run fly execute
test command using the task and a dummy book
directory:
mkdir book
./fly -t bakery-cli execute -c task.yml -i book=./book
The command above would pass consistently. However, the following failed for me most of the time (I didn't keep track of exact stats, but I'd say this failed for me ~80% of the time):
dd if=/dev/zero of=book/dummy_file bs=1024 count=512
./fly -t bakery-cli execute -c task.yml -i book=./book
The error looked representative of what we see in CLI invocations that fail:
uploading book ⠼ (714b/s)
error: 'uploading book' failed: Unexpected Response
Status: 500 Internal Server Error
Body:
Using the same setup but a smaller file would succeed (I believe always)
dd if=/dev/zero of=book/dummy_file bs=1024 count=10
./fly -t bakery-cli execute -c task.yml -i book=./book
So, given I had a fairly reproducible error that seemed to correlate to our observations, I set out to "fix" this constrained scenario.
fly
with a simple fixSince the failure wasn't deterministic, I assumed there was potentially some type of race condition. While looking at the code in the fly
CLI referenced in the issue linked above, I thought perhaps there was a missing Close
call to the zstd.NewWriter
. That also aligned somewhat with the changes in PR #5024 (which was merged but reverted later on). I quickly tried the following change to fly
:
diff --git a/fly/commands/internal/executehelpers/uploads.go b/fly/commands/internal/executehelpers/uploads.go
index 2fde78b21..377d602be 100644
--- a/fly/commands/internal/executehelpers/uploads.go
+++ b/fly/commands/internal/executehelpers/uploads.go
@@ -19,7 +19,14 @@ func Upload(bar *mpb.Bar, team concourse.Team, path string, includeIgnored bool,
archiveStream, archiveWriter := io.Pipe()
go func() {
- archiveWriter.CloseWithError(tarfs.Compress(zstd.NewWriter(archiveWriter), path, files...))
+ zstdWriter := zstd.NewWriter(archiveWriter)
+ err := tarfs.Compress(zstdWriter, path, files...)
+ if err != nil {
+ _ = zstdWriter.Close()
+ archiveWriter.CloseWithError(err)
+ return
+ }
+ archiveWriter.CloseWithError(zstdWriter.Close())
}()
return team.CreateArtifact(bar.ProxyReader(archiveStream), platform)
The above seemed to fix my synthetic reproducible bug (woo hoo!), so I proceeded to test the modified fly
with our automated tests.
fly
with CLI testsLong story short, while the modified fly seemed to improve things (e.g. I felt like there were fewer failures, but there were still some occuring), it wasn't a complete fix. Given it did solve the synthetic issue, I thought perhaps the client side was addressed and maybe there was still some issue on the server side.
While testing, I realized that if I commented the part of the automated tests where it stopped the docker containers, tests didn't seem to fail. That's when I tried adding the sleep()
in the test flow (while reinserting the docker-compose stop
during cleanup). When that seemed to work, I reverted to the vanilla version of fly
(I had commented out the sync
in the test code to avoid overwriting my modified version), and things still seemed to work.
While the client side change to fly
does seem to solve an issue (and this was actually fixed until the change had to be reverted, though I suspect it will get fixed again as part of the ongoing PR #5398), at this point it seems the sleep()
hack may be sufficient for our needs. I did try different values for sleep, but 1 second seemed to do the trick (perhaps we'll find this should be increased -- if we're using duct tape an additional layer or two doesn't hurt, right?)
More testing / thinking needed :sweat_smile: (including on the CI server as so far I've mostly been testing locally). After I convince myself I wasn't just chasing my tail / human error somewhere on my part + get some feedback, I can also point the client side aspect out upstream just in case it's useful.
that's hilarious. I was literally planning earlier today to add in a sleep myself to let it settle after startup, since I reasoned there's no way it fails that much under normal running circumstances. I'm glad there's some actual data behind it and I have no problem sleeping for 1 second if it solves an error of this severity.
I will wait to see whether the work around is found to be sufficient for our needs to determine next steps on this issue.
@rnathuji @tomjw64 This PR is in Merged column but it does not look like is ready and testable?
It is ready. The associated PR was attached to another issue
Tested on col11844 (latest) and col11992 (1.9):
node ./bakery/src/cli/execute.js fetch staging.cnx.org col11844 latest -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js assemble col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js bake col11844 /Users/om9/Documents/Projects/cnx-recipes/recipes/output/physics.css /Users/om9/Documents/Projects/cnx-recipes/styles/output/physics-pdf.css -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js mathify col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js build-pdf col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js assemble-meta col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js bake-meta col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js disassemble col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
node ./bakery/src/cli/execute.js jsonify col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data
PASSED
Internal server error issue occurring when uploading artifacts in assemble, bake, mathify, disassemble and jsonify tasks when uploading artifacts (book or cnx-recipes). The error is below and happen with or without vpn connection. To fix it, I have to repeat the action few times until it succeds (1-10x):
PR: https://app.zenhub.com/workspaces/content-engineering-tech-team-5af1f4cc12da5e6d74331b60/issues/openstax/cnx/957