omehes commented 4 years ago

Internal server error issue occurring when uploading artifacts in assemble, bake, mathify, disassemble and jsonify tasks when uploading artifacts (book or cnx-recipes). The error is below and happen with or without vpn connection. To fix it, I have to repeat the action few times until it succeds (1-10x):

(venv) Ottos-MBP-700-13:output-producer-service om9$ node ./bakery/src/cli/execute.js bake col24361 /Users/om9/Documents/Projects/e.e/cnx-recipes/recipes/output/biology.css /Users/om9/Documents/Projects/e.e/cnx-recipes/styles/output/biology-pdf.css -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data 
Starting t_concourse-db_1 ... done
Starting t_concourse_1    ... done
logging in
logging in to team 'main'

could not reach the Concourse server called bakery-cli:

    Get http://127.0.0.1:8080/api/v1/info: EOF

is the targeted Concourse running? better go catch it lol
logging in
logging in to team 'main'

target saved
syncing
version 6.0.0 already matches; skipping
executing
uploading book           done
uploading assembled-book done
uploading cnx-recipes    ⠼ (1.3KiB/s)
error: 'uploading cnx-recipes' failed: Unexpected Response
Status: 500 Internal Server Error
Body:

Error: Subprocess failed with code 1
    at completion (/Users/om9/Documents/Projects/output-producer-service/bakery/src/cli/execute.js:10:17)
    at flyExecute (/Users/om9/Documents/Projects/output-producer-service/bakery/src/cli/execute.js:128:11)
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  message: 'Subprocess failed with code 1'
}
cleaning up
Stopping t_concourse_1    ... done
Stopping t_concourse-db_1 ... done

PR: https://app.zenhub.com/workspaces/content-engineering-tech-team-5af1f4cc12da5e6d74331b60/issues/openstax/cnx/957

rnathuji commented 4 years ago

I have observed this issue as well a few times while running the bake command. I did find the following Concourse issues which seem to be related:

These all have to do with the underlying use of compression on the uploads during fly execute. I did take the example from the comment in Concourse issue 4621 and was able to get a fairly reproducible error running that against the bakery cli concourse server, though I had to use a slightly larger file size of 512k vs 128k:

uploading source ⠼ (1.1KiB/s)
error: 'uploading source' failed: Unexpected Response
Status: 500 Internal Server Error
Body:

NOTE: To run that test, I temporarily commented the code in execute-js that tears down the docker-compose environment so that I could experiment with it outside of the task invocations.

Corresponding log entries from Concourse:

tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
{"timestamp":"2020-04-21T15:28:01.721841815Z","level":"info","source":"baggageclaim","message":"baggageclaim.api.volume-server.stream-in.bad-stream-payload","data":{"error":"exit status 2","session":"5.1.2629","volume":"2ec66010-867e-415f-5d96-7901295d5778"}}
{"timestamp":"2020-04-21T15:28:01.723077003Z","level":"error","source":"atc","message":"atc.create-artifact.failed-to-stream-volume-contents","data":{"error":"failed to stream in to volume","session":"10630"}}

The switch to zstd seems to have happened on version 5.4: https://github.com/concourse/concourse/releases/tag/v5.4.0

tomjw64 commented 4 years ago

Wow, good find on that GitHub issue. I searched for a long time without anything to show for it.

tomjw64 commented 4 years ago

This is now blocking further CLI + CLI driven testing work on output-producer-service. We have determined that a fix or workaround for this bug is necessary before the CLI or the tests that use it can be considered meaningful.

tomjw64 commented 4 years ago

Looks like that PR you linked has recent activity. That's a good sign.

rnathuji commented 4 years ago

Update on initial investigation and findings

I'm documenting some of the testing / investigations that I did as part of this issue, but the TLDR is it seems a fix / work around is in addition to waiting for the Concourse port to become responsive, we should wait a second for it to "settle". In other words, the following seems to fix the observed behavior:

diff --git a/bakery/src/cli/execute.js b/bakery/src/cli/execute.js
index 3cf0680..7bcedad 100644
--- a/bakery/src/cli/execute.js
+++ b/bakery/src/cli/execute.js
@@ -84,6 +84,9 @@ const flyExecute = async cmdArgs => {
     console.log('waiting for concourse to wake up')
     await waitPort({ protocol: 'http', host: 'localhost', port: 8080, path: '/api/v1/info', timeout: 120_000 })

+    console.log('waiting for concourse to -- "settle")
+    sleep.sleep(1)
+
     console.log('logging in')
     const login = spawn('fly', [
       'login',

If that leaves you somewhat unsatisfied and / or asking yourself "what kind of solution is that???", all I can say is it's the one that solved the problem I was left with after I solved the root problem (I thought) I discovered. You may find some solace in reading the remainder of this post...or not...

The general approach I took initially was to construct a synthetic example that captured the issue and assume it correlated to what we were experiencing. The idea being fixing the former would address the latter. And away we go...

Step 1: Create a reproducible failure for investigating root cause

Based upon this comment, I created a local environment outside of the CLI execution consisting of a basic docker-compose.yml (copied from execute.js) and a simple dummy task.

Simple docker-compose.yml file for spinning up Concourse and Postgres:

---
version: "3"

services:
  concourse-db:
    image: postgres
    environment:
      - POSTGRES_DB=concourse
      - POSTGRES_PASSWORD=concourse_pass
      - POSTGRES_USER=concourse_user
      - PGDATA=/database

  concourse:
    image: concourse/concourse:6.0
    command: quickstart
    privileged: true
    depends_on: [concourse-db]
    ports: ["8080:8080"]
    environment:
      - CONCOURSE_POSTGRES_HOST=concourse-db
      - CONCOURSE_POSTGRES_USER=concourse_user
      - CONCOURSE_POSTGRES_PASSWORD=concourse_pass
      - CONCOURSE_POSTGRES_DATABASE=concourse
      - CONCOURSE_EXTERNAL_URL
      - CONCOURSE_ADD_LOCAL_USER=admin:admin
      - CONCOURSE_MAIN_TEAM_LOCAL_USER=admin

That can be spun up per the following:

docker-compose -f /openstax/output-producer-service/docker-compose-2296.yml up -d
./fly login -k -t bakery-cli -c http://127.0.0.1:8080 -u admin -p admin
./fly -t bakery-cli sync

Simple test task.yml used to exercise the server:

---
platform: linux

image_resource:
  type: docker-image
  source:
    repository: bash

inputs:
- name: book

run:
  path: bash
  args:
    - -c
    - |-
      echo "Hello World"

Run fly execute test command using the task and a dummy book directory:

mkdir book
./fly -t bakery-cli execute -c task.yml -i book=./book

The command above would pass consistently. However, the following failed for me most of the time (I didn't keep track of exact stats, but I'd say this failed for me ~80% of the time):

dd if=/dev/zero of=book/dummy_file bs=1024 count=512
./fly -t bakery-cli execute -c task.yml -i book=./book

The error looked representative of what we see in CLI invocations that fail:

uploading book ⠼ (714b/s)
error: 'uploading book' failed: Unexpected Response
Status: 500 Internal Server Error
Body:

Using the same setup but a smaller file would succeed (I believe always)

dd if=/dev/zero of=book/dummy_file bs=1024 count=10
./fly -t bakery-cli execute -c task.yml -i book=./book

So, given I had a fairly reproducible error that seemed to correlate to our observations, I set out to "fix" this constrained scenario.

Step 2: Patching `fly` with a simple fix

Since the failure wasn't deterministic, I assumed there was potentially some type of race condition. While looking at the code in the fly CLI referenced in the issue linked above, I thought perhaps there was a missing Close call to the zstd.NewWriter. That also aligned somewhat with the changes in PR #5024 (which was merged but reverted later on). I quickly tried the following change to fly:

diff --git a/fly/commands/internal/executehelpers/uploads.go b/fly/commands/internal/executehelpers/uploads.go
index 2fde78b21..377d602be 100644
--- a/fly/commands/internal/executehelpers/uploads.go
+++ b/fly/commands/internal/executehelpers/uploads.go
@@ -19,7 +19,14 @@ func Upload(bar *mpb.Bar, team concourse.Team, path string, includeIgnored bool,
        archiveStream, archiveWriter := io.Pipe()

        go func() {
-               archiveWriter.CloseWithError(tarfs.Compress(zstd.NewWriter(archiveWriter), path, files...))
+               zstdWriter := zstd.NewWriter(archiveWriter)
+               err := tarfs.Compress(zstdWriter, path, files...)
+               if err != nil {
+                       _ = zstdWriter.Close()
+                       archiveWriter.CloseWithError(err)
+                       return
+               }
+               archiveWriter.CloseWithError(zstdWriter.Close())
        }()

        return team.CreateArtifact(bar.ProxyReader(archiveStream), platform)

The above seemed to fix my synthetic reproducible bug (woo hoo!), so I proceeded to test the modified fly with our automated tests.

Step 3: Try testing modified `fly` with CLI tests

Long story short, while the modified fly seemed to improve things (e.g. I felt like there were fewer failures, but there were still some occuring), it wasn't a complete fix. Given it did solve the synthetic issue, I thought perhaps the client side was addressed and maybe there was still some issue on the server side.

While testing, I realized that if I commented the part of the automated tests where it stopped the docker containers, tests didn't seem to fail. That's when I tried adding the sleep() in the test flow (while reinserting the docker-compose stop during cleanup). When that seemed to work, I reverted to the vanilla version of fly (I had commented out the sync in the test code to avoid overwriting my modified version), and things still seemed to work.

Summary

While the client side change to fly does seem to solve an issue (and this was actually fixed until the change had to be reverted, though I suspect it will get fixed again as part of the ongoing PR #5398), at this point it seems the sleep() hack may be sufficient for our needs. I did try different values for sleep, but 1 second seemed to do the trick (perhaps we'll find this should be increased -- if we're using duct tape an additional layer or two doesn't hurt, right?)

More testing / thinking needed :sweat_smile: (including on the CI server as so far I've mostly been testing locally). After I convince myself I wasn't just chasing my tail / human error somewhere on my part + get some feedback, I can also point the client side aspect out upstream just in case it's useful.

tomjw64 commented 4 years ago

that's hilarious. I was literally planning earlier today to add in a sleep myself to let it settle after startup, since I reasoned there's no way it fails that much under normal running circumstances. I'm glad there's some actual data behind it and I have no problem sleeping for 1 second if it solves an error of this severity.

rnathuji commented 4 years ago

I will wait to see whether the work around is found to be sufficient for our needs to determine next steps on this issue.

omehes commented 4 years ago

@rnathuji @tomjw64 This PR is in Merged column but it does not look like is ready and testable?

tomjw64 commented 4 years ago

It is ready. The associated PR was attached to another issue

omehes commented 4 years ago

Tested on col11844 (latest) and col11992 (1.9):

node ./bakery/src/cli/execute.js fetch staging.cnx.org col11844 latest -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js assemble col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js bake col11844 /Users/om9/Documents/Projects/cnx-recipes/recipes/output/physics.css /Users/om9/Documents/Projects/cnx-recipes/styles/output/physics-pdf.css -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js mathify col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js build-pdf col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js assemble-meta col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js bake-meta col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js disassemble col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

node ./bakery/src/cli/execute.js jsonify col11844 -c /Users/om9/Documents/Projects/output-producer-service -d /Users/om9/Documents/Projects/output-producer-service/bakery/data

PASSED

openstax / webview

Internal server error when using assemble, bake, etc tasks with CLI #2296

Update on initial investigation and findings

Step 1: Create a reproducible failure for investigating root cause

Step 2: Patching `fly` with a simple fix

Step 3: Try testing modified `fly` with CLI tests

Summary

openstax / webview

Internal server error when using assemble, bake, etc tasks with CLI #2296

Update on initial investigation and findings

Step 1: Create a reproducible failure for investigating root cause

Step 2: Patching fly with a simple fix

Step 3: Try testing modified fly with CLI tests

Summary

Step 2: Patching `fly` with a simple fix

Step 3: Try testing modified `fly` with CLI tests