tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 73 forks source link

Storing logs fails with `operation error S3: UploadPart, https response error StatusCode: 404 ... api error NoSuchUpload: UnknownError` #719

Open iainsproat opened 7 months ago

iainsproat commented 7 months ago

There is a step in a pipeline run which fails with an error (the application exits with a non-zero exit code). Unfortunately the log content is not stored.

All other pipelines and steps have their logs stored as expected, so this does not seem to be a general configuration problem.

Expected Behavior

All steps in the pipeline would have their logs stored, even (and especially) if the user application exited with a non-zero exit code. A copy of the pipeline can be found below.

This is consistently failing for all numerous repeats.

Actual Behavior

The Tekton results API logs the following two errors:

{
  "level": "error",
  "ts": 1709314179.3972826,
  "caller": "v1alpha2/logs.go:103",
  "msg": "operation error S3: UploadPart, https response error StatusCode: 404, RequestID: tx00000563224ac1343fcba-0065e21083-4fb58942-ams3c, HostID: 4fb58942-ams3c-ams3-zg03, api error NoSuchUpload: UnknownError",
  "stacktrace": "github.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).UpdateLog.func1\n\tgithub.com/tektoncd/results/pkg/api/server/v1alpha2/logs.go:103\ngithub.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).UpdateLog\n\tgithub.com/tektoncd/results/pkg/api/server/v1alpha2/logs.go:156\ngithub.com/tektoncd/results/proto/v1alpha2/results_go_proto._Logs_UpdateLog_Handler\n\tgithub.com/tektoncd/results/proto/v1alpha2/results_go_proto/api_grpc.pb.go:686\ngithub.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.5/interceptors/recovery/interceptors.go:48\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:121\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/auth.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/auth/auth.go:66\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:53\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1673\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1016"
}
{
  "level": "error",
  "ts": 1709314179.3973947,
  "caller": "zap/options.go:212",
  "msg": "finished streaming call with code Unknown",
  "grpc.auth_disabled": true,
  "grpc.start_time": "2024-03-01T17:29:38Z",
  "system": "grpc",
  "span.kind": "server",
  "grpc.service": "tekton.results.v1alpha2.Logs",
  "grpc.method": "UpdateLog",
  "peer.address": "10.244.1.76:58650",
  "grpc.user": "system:serviceaccount:tekton-pipelines:tekton-results-watcher",
  "grpc.issuer": "https://kubernetes.default.svc.cluster.local",
  "error": "operation error S3: UploadPart, failed to rewind transport stream for retry, request stream is not seekable",
  "grpc.code": "Unknown",
  "grpc.time_duration_in_ms": 766,
  "stacktrace": "github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:61\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1673\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1016"
}

I can see that a corresponding log record has been stored:

{
  "name": "e271e90551-68622d75ea-b5dd1651b51a2a99cc0d/results/8b706226-37fa-4e79-add7-5e6f1d4e7534/records/2c4acb8d-3234-3176-93b9-6df94deb472c",
  "id": "7ca4d473-7f0f-41e1-92b3-a85083b9e7ad",
  "uid": "7ca4d473-7f0f-41e1-92b3-a85083b9e7ad",
  "data": {
    "type": "results.tekton.dev/v1alpha2.Log",
    "value": "eyJraW5kIjogIkxvZyIsICJzcGVjIjogeyJ0eXBlIjogIlMzIiwgInJlc291cmNlIjogeyJ1aWQiOiAiMDAxNDNmOTktODU5MC00MjJmLWFkMTQtM2FkOThjOGFhYWE3IiwgImtpbmQiOiAiVGFza1J1biIsICJuYW1lIjogImI1ZGQxNjUxYjUxYTJhOTljYzBkLTcxZmExNDI0MzctM2EwMGQ0YmJjMi0wIiwgIm5hbWVzcGFjZSI6ICJlMjcxZTkwNTUxLTY4NjIyZDc1ZWEtYjVkZDE2NTFiNTFhMmE5OWNjMGQifX0sICJzdGF0dXMiOiB7InBhdGgiOiAiZTI3MWU5MDU1MS02ODYyMmQ3NWVhLWI1ZGQxNjUxYjUxYTJhOTljYzBkLzJjNGFjYjhkLTMyMzQtMzE3Ni05M2I5LTZkZjk0ZGViNDcyYy9iNWRkMTY1MWI1MWEyYTk5Y2MwZC03MWZhMTQyNDM3LTNhMDBkNGJiYzItMC1sb2ciLCAic2l6ZSI6IDIwNzA5Mzc2MH0sICJtZXRhZGF0YSI6IHsidWlkIjogIjJjNGFjYjhkLTMyMzQtMzE3Ni05M2I5LTZkZjk0ZGViNDcyYyIsICJuYW1lIjogImI1ZGQxNjUxYjUxYTJhOTljYzBkLTcxZmExNDI0MzctM2EwMGQ0YmJjMi0wLWxvZyIsICJuYW1lc3BhY2UiOiAiZTI3MWU5MDU1MS02ODYyMmQ3NWVhLWI1ZGQxNjUxYjUxYTJhOTljYzBkIiwgImNyZWF0aW9uVGltZXN0YW1wIjogbnVsbH0sICJhcGlWZXJzaW9uIjogInJlc3VsdHMudGVrdG9uLmRldi92MWFscGhhMiJ9"
  },
  "etag": "7ca4d473-7f0f-41e1-92b3-a85083b9e7ad-1709314181023022210",
  "createdTime": "2024-03-01T17:29:38.541771Z",
  "createTime": "2024-03-01T17:29:38.541771Z",
  "updatedTime": "2024-03-01T17:29:41.023022Z",
  "updateTime": "2024-03-01T17:29:41.023022Z"
}

For convenience, the decoded base64 content of the above value is:

{
  "kind": "Log",
  "spec": {
    "type": "S3",
    "resource": {
      "uid": "00143f99-8590-422f-ad14-3ad98c8aaaa7",
      "kind": "TaskRun",
      "name": "b5dd1651b51a2a99cc0d-71fa142437-3a00d4bbc2-0",
      "namespace": "e271e90551-68622d75ea-b5dd1651b51a2a99cc0d"
    }
  },
  "status": {
    "path": "e271e90551-68622d75ea-b5dd1651b51a2a99cc0d/2c4acb8d-3234-3176-93b9-6df94deb472c/b5dd1651b51a2a99cc0d-71fa142437-3a00d4bbc2-0-log",
    "size": 207093760
  },
  "metadata": {
    "uid": "2c4acb8d-3234-3176-93b9-6df94deb472c",
    "name": "b5dd1651b51a2a99cc0d-71fa142437-3a00d4bbc2-0-log",
    "namespace": "e271e90551-68622d75ea-b5dd1651b51a2a99cc0d",
    "creationTimestamp": null
  },
  "apiVersion": "results.tekton.dev/v1alpha2"
}

When inspecting the content of the s3 blob storage directly I can see that no bucket has been created at the given path.

Likewise the logs cannot be retrieved via the API, as would be expected given the above symptoms.

Steps to Reproduce the Problem

Unfortunately I am not able to yet recreate a minimally failing example of the user workload generating the logs. The pipeline is as follows:

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  labels:
    app.kubernetes.io/instance: pipelines
    app.kubernetes.io/name: pipelines
    app.kubernetes.io/part-of: e271e90551
    app.kubernetes.io/version: 68622d75ea
    kubernetes.io/metadata.name: e271e90551-68622d75ea-b5dd1651b51a2a99cc0d
  name: e271e90551-68622d75ea
  namespace: e271e90551-68622d75ea-b5dd1651b51a2a99cc0d
spec:
  tasks:
  - name: 71fa142437-3a00d4bbc2-0
    taskSpec:
      metadata: {}
      spec: null
      steps:
      - args:
        - /userinput/values.json
        command:
        - dotnet
        - function.dll
        computeResources:
          limits:
            cpu: "1"
            memory: 1000Mi
          requests:
            cpu: "1"
            memory: 1000Mi
        image: private_registry/71fa142437:d66859c
        name: 71fa142437-3a00d4bbc2-0
        volumeMounts:
        - mountPath: /userinput
          name: 71fa142437-3a00d4bbc2-0
          readOnly: true
      volumes:
      - name: 71fa142437-3a00d4bbc2-0
        secret:
          secretName: 71fa142437-3a00d4bbc2-0

Additional Info

v1.28.2

v0.56.1

v0.9.1

iainsproat commented 7 months ago

Excerpt from the above:

"status": {
    "path": "e271e90551-68622d75ea-b5dd1651b51a2a99cc0d/2c4acb8d-3234-3176-93b9-6df94deb472c/b5dd1651b51a2a99cc0d-71fa142437-3a00d4bbc2-0-log",
    "size": 207093760

Does this status.size value indicate a log size of 207Mb? Or am I misinterpreting this?

iainsproat commented 6 months ago

To answer my own comment, it seems to be bytes so this would be 207Mb: https://github.com/tektoncd/results/blob/32314594d4e5cf6de35e13cb7429ae216a969781/pkg/api/server/v1alpha2/logs.go#L173C21-L173C28

sayan-biswas commented 6 months ago

@iainsproat I believe the size of the log has something to do with this. Are you still having difficulties in storing large logs?

iainsproat commented 6 months ago

Hi @sayan-biswas - yes, this continues to be an issue for me.

Unfortunately I haven't had time to yet recreate a minimal replicating example so I'm as yet not sure if it's an issue with our cloud provider, my configuration of s3 or logs in tekton results, or something amiss with tekton results.