openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.48k stars 4.7k forks source link

Error: image cache/mysql:pullthrough not found #8399

Closed mfojtik closed 8 years ago

mfojtik commented 8 years ago

Potential flake. Can be seen here: https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5534/console

Error: image cache/mysql:pullthrough not found
!!! Error in hack/../hack/../test/end-to-end/core.sh:129
    'docker pull ${DOCKER_REGISTRY}/cache/mysql:pullthrough' exited with status 1
Call stack:
    1: hack/../hack/../test/end-to-end/core.sh:129 main(...)
Exiting with status 1
soltysh commented 8 years ago

@miminar any idea why docker registry is falling to v1 registry, this is what I see in the logs:

time="2016-04-07T13:33:03.823468206Z" level=info msg="response completed" go.version=go1.6 http.request.host="172.30.16.220:5000" http.request.id=942210a3-4315-4106-9471-1bf66e4cc6af http.request.method=GET http.request.remoteaddr="172.18.6.120:60242" http.request.uri="/v2/cache/mysql/blobs/sha256:efd26ecc95486998b41b3fe167236e3fb3e109c66dd1a51ab5161e40b06cf486" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/octet-stream" http.response.duration=1.921677219s http.response.status=200 http.response.written=51342828 instance.id=f79ccee9-263a-4bae-aeb8-77e72c26188f 
172.18.6.120 - - [07/Apr/2016:13:33:01 +0000] "GET /v2/cache/mysql/blobs/sha256:efd26ecc95486998b41b3fe167236e3fb3e109c66dd1a51ab5161e40b06cf486 HTTP/1.1" 200 51342828 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"
time="2016-04-07T13:33:03.925080703Z" level=info msg="response completed" go.version=go1.6 http.request.host="172.30.16.220:5000" http.request.id=2fecfaf9-1878-4588-9710-f3ec32ee52ba http.request.method=GET http.request.remoteaddr="172.18.6.120:60243" http.request.uri="/v2/cache/mysql/blobs/sha256:87c7daf4abfabeef18ae19cd661840282d149989c039cdd2f426535e60e0e703" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/octet-stream" http.response.duration=1.976474849s http.response.status=200 http.response.written=63951640 instance.id=f79ccee9-263a-4bae-aeb8-77e72c26188f 
172.18.6.120 - - [07/Apr/2016:13:33:01 +0000] "GET /v2/cache/mysql/blobs/sha256:87c7daf4abfabeef18ae19cd661840282d149989c039cdd2f426535e60e0e703 HTTP/1.1" 200 63951640 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"
172.17.0.1 - - [07/Apr/2016:13:33:05 +0000] "GET /healthz HTTP/1.1" 200 0 "" "Go-http-client/1.1"
172.17.0.1 - - [07/Apr/2016:13:33:05 +0000] "GET /healthz HTTP/1.1" 200 0 "" "Go-http-client/1.1"
172.18.6.120 - - [07/Apr/2016:13:33:13 +0000] "GET /v1/repositories/cache/mysql/images HTTP/1.1" 404 19 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"

It's downloading the image just fine, but suddenly it reaches for v1 registry, any ideas?

soltysh commented 8 years ago

I have the full docker registry log, if needed.

smarterclayton commented 8 years ago

If this is a flake that will affect users in production it's p0 (until we prove otherwise).

soltysh commented 8 years ago

After several investigations the problem I've managed to identify is this error:

Error getting ImageStreamTag "pullthrough": imagestreamtags "mysql:pullthrough" not found

The #8461 wrapped image-import and docker pull commands into os::cmd::expect_success, which should either fix the problem or give us some more info. I'm especially hoping to get stdout and stderr from the image-import command which might fail for some unknown reason leading to failing docker pull. For now I'm advising to lower the priority of the bug, since it's not a permanent issue but rather a interminent problem.

liggitt commented 8 years ago

seen in https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/41/console

soltysh commented 8 years ago

This looks like a different error:

time="2016-04-13T04:42:29.034858911Z" level=error msg="response completed with error" err.code="BLOB_UNKNOWN" err.detail=sha256:d347a60f2547ec23a95f6a82a7d86bfc3175111f4f730daf0b09e4d3adc875f3 err.message="blob unknown to registry" go.version=go1.6 http.request.host="172.30.25.55:5000" http.request.id=f5b9eb63-a4c3-4c22-af71-1f74d953f831 http.request.method=HEAD http.request.remoteaddr="172.18.15.86:42491" http.request.uri="/v2/cache/mysql/blobs/sha256:d347a60f2547ec23a95f6a82a7d86bfc3175111f4f730daf0b09e4d3adc875f3" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=595.411786ms http.response.status=404 http.response.written=157 instance.id=be1a0e30-a5ce-41bd-8e98-682ed6469885 vars.digest="sha256:d347a60f2547ec23a95f6a82a7d86bfc3175111f4f730daf0b09e4d3adc875f3" vars.name="cache/mysql" 

Looking into it.

miminar commented 8 years ago

Here's another: https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5597/console (PR #8438)

soltysh commented 8 years ago

@miminar good catch, now that we have the import output this is what we got: import failed: you may not have access to the Docker image "mysql:latest We're currently being hit by this problem in two places:

  1. at import time
  2. at pull time

To nail the problem we'd have to log entire traffic between us and docker registry. It looks like docker registry is sometimes blocking us, requiring auth (?), because I can't seem to find any other reasonable explanation. Especially that the image we're trying to get is public, even more docker-provided. @smarterclayton @pweil- @miminar thoughts?

soltysh commented 8 years ago

For no 1. we could modify our tests in such a way that if it detects unauthorized error it'll retry, no idea how to address no 2. without code modifications, yet.

smarterclayton commented 8 years ago

Has the integration test flaked recently?

smarterclayton commented 8 years ago

Can we get more tracing in those cases? Possibly the entire response body and headers so we can compare?

On Apr 18, 2016, at 5:23 AM, Maciej Szulik notifications@github.com wrote:

For no 1. we could modify our tests in such a way that if it detects unauthorized error it'll retry, no idea how to address no 2. without code modifications, yet.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/8399#issuecomment-211292729

soltysh commented 8 years ago

That's what I was thinking about, but not sure if it's possible post-mortem. I'll dig into it.

smarterclayton commented 8 years ago

This may be related - saw this in against our registry

STEP: dockercfg with svrAddr 172.30.165.70:5000 user serviceaccount pass eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJleHRlbmRlZC10ZXN0LWZvcmNlcHVsbC1oYmtjeC00djh2aSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJidWlsZGVyLXRva2VuLXRzbzlqIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6ImJ1aWxkZXIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiIwNDA4NWFlOS0wNWE4LTExZTYtODMxNS0wZWE5NmMxYTA0ZjMiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6ZXh0ZW5kZWQtdGVzdC1mb3JjZXB1bGwtaGJrY3gtNHY4dmk6YnVpbGRlciJ9.tY0PwtEFs9d2CyNioUeq9op6twd-9_mxZK2xvaRFTTOrY0FhKTPrm6NwqXM1gvgTIM24jzowR6SWAEWgstDJYjRNEX-rirBG7EBqkIzskYQ5RWaN_pe71JHCZluJm_KGgf2QKxlrDZQN-AxcJGLge35-Yiw3wkZKmvJMB56jJzzPSDtg6o8jciPvjKSoD6J0zhpzR5FcYMwqyqzF2RevMJs4QnRvijvmHickXxR3OjXiP3nLUWF-dVsWb64ZcRgH2bnMotSCQlj2EcsFPlhmlxurfF9N91OK5aiLxhGXjxsyXLe8cQz-2rlAY007JWbrl0UEGtXGY6ep6KegN5KwPQ email serviceaccount@example.org 
STEP: Collecting events from namespace "extended-test-forcepull-hbkcx-4v8vi".
Apr 18 16:59:49.453: INFO: At 2016-04-18 16:56:57 -0400 EDT - event for forcepull-bldr-1-build: {default-scheduler } Scheduled: Successfully assigned forcepull-bldr-1-build to 172.18.2.57
Apr 18 16:59:49.453: INFO: At 2016-04-18 16:57:00 -0400 EDT - event for forcepull-bldr-1-build: {kubelet 172.18.2.57} Pulled: Container image "openshift/origin-docker-builder:11af88f" already present on machine
Apr 18 16:59:49.453: INFO: At 2016-04-18 16:57:01 -0400 EDT - event for forcepull-bldr-1-build: {kubelet 172.18.2.57} Created: Created container with docker id 81a3ea79eb24
Apr 18 16:59:49.453: INFO: At 2016-04-18 16:57:02 -0400 EDT - event for forcepull-bldr-1-build: {kubelet 172.18.2.57} Started: Started container with docker id 81a3ea79eb24
Apr 18 16:59:49.472: INFO: POD                      NODE         PHASE      GRACE  CONDITIONS
Apr 18 16:59:49.472: INFO: docker-registry-1-0dmbo  172.18.2.57  Running           [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:39:51 -0400 EDT  }]
Apr 18 16:59:49.472: INFO: router-1-deploy          172.18.2.57  Failed            []
Apr 18 16:59:49.472: INFO: router-2-rdhzw           172.18.2.57  Running           [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:40:20 -0400 EDT  }]
Apr 18 16:59:49.472: INFO: scale-up-igonv           172.18.2.57  Running    30s    [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:59:16 -0400 EDT  }]
Apr 18 16:59:49.472: INFO: forcepull-bldr-1-build   172.18.2.57  Failed            [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:59:47 -0400 EDT ContainersNotReady containers with unready status: [docker-build]}]
Apr 18 16:59:49.472: INFO: src-1-build              172.18.2.57  Succeeded         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:58:23 -0400 EDT PodCompleted }]
Apr 18 16:59:49.472: INFO: src-2-build              172.18.2.57  Succeeded         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:58:51 -0400 EDT PodCompleted }]
Apr 18 16:59:49.472: INFO: src-3-build              172.18.2.57  Succeeded         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:59:16 -0400 EDT PodCompleted }]
Apr 18 16:59:49.472: INFO: src-4-build              172.18.2.57  Succeeded         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:59:32 -0400 EDT PodCompleted }]
Apr 18 16:59:49.472: INFO: src-5-build              172.18.2.57  Succeeded         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-04-18 16:59:47 -0400 EDT PodCompleted }]
Apr 18 16:59:49.472: INFO: 
Apr 18 16:59:49.480: INFO: 
Logging node info for node 172.18.2.57
Apr 18 16:59:49.488: INFO: Node Info: &{{ } {172.18.2.57   /api/v1/nodes/172.18.2.57 a22526c7-05a5-11e6-8315-0ea96c1a04f3 12989 0 2016-04-18 16:39:22 -0400 EDT <nil> <nil> map[kubernetes.io/hostname:172.18.2.57] map[]} { 172.18.2.57  false} {map[cpu:{2.000 DecimalSI} memory:{7934193664.000 BinarySI} pods:{110.000 DecimalSI}] map[pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI} memory:{7934193664.000 BinarySI}]  [{OutOfDisk False 2016-04-18 16:59:44 -0400 EDT 2016-04-18 16:39:22 -0400 EDT KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-04-18 16:59:44 -0400 EDT 2016-04-18 16:39:33 -0400 EDT KubeletReady kubelet is posting ready status}] [{LegacyHostIP 172.18.2.57} {InternalIP 172.18.2.57}] {{10250}} {f9370ed252a14f73b014c1301a9b6d1b EC22CE7C-CCE4-AF2E-3176-6F71E6F9D19E 0f561e5b-3aac-4c55-828c-2164576e221d 3.10.0-229.7.2.el7.x86_64 Red Hat Enterprise Linux Server 7.2 (Maipo) docker://1.9.1 v1.2.0-36-g4a3f9c5 v1.2.0-36-g4a3f9c5} [{[<none>:<none>] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-hom8f-shared/src:4] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-hom8f-shared/src:3] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-hom8f-shared/src:2] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-hom8f-shared/src:1] 100} {[172.30.165.70:5000/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder:latest] 443310209} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/sized:foo] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/new:refused] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/other:refused] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/other:second] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/other:third] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/sized:second] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/sized:first] 100} {[172.30.165.70:5000/extended-test-imagequota-admission-r6349-1dzdb/sized:refused] 100} {[openshift/deployment-example:v2] 7585584} {[openshift/deployment-example:v1] 7585584} {[openshift/node:11af88f openshift/node:latest] 636119156} {[openshift/origin-f5-router:11af88f openshift/origin-f5-router:latest] 543735769} {[openshift/origin-sti-builder:11af88f openshift/origin-sti-builder:latest] 543735769} {[openshift/origin-gitserver:11af88f openshift/origin-gitserver:latest] 581438641} {[openshift/origin-docker-builder:11af88f openshift/origin-docker-builder:latest] 543735769} {[openshift/origin-recycler:11af88f openshift/origin-recycler:latest] 34975664} {[openshift/origin-deployer:11af88f openshift/origin-deployer:latest] 543735769} {[openshift/origin-docker-registry:11af88f openshift/origin-docker-registry:latest] 341697503} {[openshift/origin-keepalived-ipfailover:11af88f openshift/origin-keepalived-ipfailover:latest] 437550234} {[openshift/origin-haproxy-router:11af88f openshift/origin-haproxy-router:latest] 431806667} {[openshift/origin:11af88f openshift/origin:latest] 543735769} {[<none>:<none>] 7585584} {[<none>:<none>] 7585584} {[openshift/hello-openshift:latest openshift/hello-openshift:11af88f openshift/hello-openshift:c3ada66] 7584592} {[openshift/origin-custom-docker-builder:11af88f openshift/origin-custom-docker-builder:c3ada66 openshift/origin-custom-docker-builder:latest] 402543189} {[openshift/node:c3ada66] 636119245} {[openshift/origin-f5-router:c3ada66] 543735769} {[openshift/origin-sti-builder:c3ada66] 543735769} {[openshift/origin-gitserver:c3ada66] 581438641} {[openshift/origin-docker-builder:c3ada66] 543735769} {[openshift/origin-recycler:c3ada66] 34975664} {[openshift/origin-deployer:c3ada66] 543735769} {[openshift/origin-egress-router:latest openshift/origin-egress-router:11af88f openshift/origin-egress-router:c3ada66] 298249834} {[openshift/origin-docker-registry:c3ada66] 341697503} {[openshift/origin-keepalived-ipfailover:c3ada66] 437550234} {[openshift/origin-haproxy-router:c3ada66] 431806667} {[openshift/origin:c3ada66] 543735769} {[openshift/openvswitch:11af88f openshift/openvswitch:c3ada66 openshift/openvswitch:latest] 228138943} {[openshift/origin-pod:11af88f openshift/origin-pod:c3ada66 openshift/origin-pod:latest] 1590312} {[rhel7.2:latest] 256294412} {[rhel7:latest rhel7.1:latest] 753998457} {[docker.io/openshift/origin-base:latest] 279579166} {[openshift/origin-release:latest] 711006224} {[openshift/origin-haproxy-router-base:latest] 298871742} {[openshift/origin-base:latest] 279586407} {[docker.io/openshift/hello-openshift:latest] 7584592} {[docker.io/openshift/origin-haproxy-router:latest] 431807192} {[docker.io/openshift/origin-pod:latest] 1590312} {[] 539322259} {[docker.io/centos/ruby-22-centos7:latest] 443849262} {[docker.io/openshift/ruby-20-centos7:latest] 443310209} {[docker.io/nginx:latest] 182610738} {[docker.io/perl:latest] 653322790} {[docker.io/centos:centos7] 196677849} {[registry.access.redhat.com/rhel7.2:latest] 203268847} {[docker.io/busybox:latest] 1112820} {[docker.io/redis:latest] 177537508} {[gcr.io/google_containers/nettest:1.7] 24051275} {[gcr.io/google_containers/netexec:1.4] 7293444} {[gcr.io/google_containers/busybox:1.24] 1113554} {[gcr.io/google_containers/mounttest:0.6] 2084693} {[gcr.io/google_containers/hostexec:1.2] 13185747} {[gcr.io/google_containers/volume-ceph:0.1] 675657560} {[gcr.io/google_containers/mounttest-user:0.3] 1718853} {[registry.access.redhat.com/rhel7.1:latest] 158229901} {[gcr.io/google_containers/mounttest:0.5] 1718853} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab] 5010921} {[gcr.io/google_containers/portforwardtester:1.0] 2296329} {[gcr.io/google_containers/volume-iscsi:0.1] 244433376} {[docker.io/openshift/deployment-example:v1] 5769840} {[gcr.io/google_containers/eptest:0.1] 2970692} {[gcr.io/google_containers/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/update-demo:nautilus] 4555533} {[gcr.io/google_containers/nginx:1.7.9] 91641000} {[gcr.io/google_containers/test-webserver:e2e] 4534272} {[gcr.io/google_containers/busybox:latest] 2429728} {[gcr.io/google_containers/liveness:e2e] 4387474}]}}
Apr 18 16:59:49.488: INFO: 
Logging kubelet events for node 172.18.2.57
Apr 18 16:59:49.492: INFO: 
Logging pods the kubelet thinks is on node 172.18.2.57
Apr 18 16:59:49.510: INFO: docker-registry-1-0dmbo started at <nil> (0 container statuses recorded)
Apr 18 16:59:49.510: INFO: scale-up-igonv started at <nil> (0 container statuses recorded)
Apr 18 16:59:49.510: INFO: router-2-rdhzw started at <nil> (0 container statuses recorded)
Apr 18 16:59:49.967: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 51 @[0]
Apr 18 16:59:49.967: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 27 @[0]
Apr 18 16:59:49.967: INFO: ERROR kubelet_docker_errors{operation_type="logs"} => 1 @[0]
Apr 18 16:59:49.967: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 7 @[0]
Apr 18 16:59:49.967: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 122 @[0]
Apr 18 16:59:49.967: INFO: 
Latency metrics for node 172.18.2.57
Apr 18 16:59:49.967: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:52.893417s}
Apr 18 16:59:49.967: INFO: {Operation:update Method:pod_worker_latency_microseconds Quantile:0.99 Latency:35.010402s}
Apr 18 16:59:49.967: INFO: {Operation:update Method:pod_worker_latency_microseconds Quantile:0.9 Latency:31.860651s}
Apr 18 16:59:49.967: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:30.620567s}
Apr 18 16:59:49.967: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.99 Latency:29.907848s}
Apr 18 16:59:49.967: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:27.24821s}
Apr 18 16:59:49.967: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:24.684774s}
Apr 18 16:59:49.967: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:23.465448s}
Apr 18 16:59:49.967: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:18.697897s}
Apr 18 16:59:49.967: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:15.649023s}
Apr 18 16:59:49.967: INFO: Unknown output type: . Skipping.
Apr 18 16:59:49.967: INFO: Waiting up to 1m0s for all nodes to be ready
STEP: Destroying namespace "extended-test-forcepull-hbkcx-4v8vi" for this suite.

• Failure in Spec Setup (BeforeEach) [208.877 seconds]
[LocalNode][builds] forcePull should affect pulling builder images
/data/src/github.com/openshift/origin/test/extended/builds/forcepull.go:166
  ForcePull test context   [BeforeEach]
  /data/src/github.com/openshift/origin/test/extended/builds/forcepull.go:164
    ForcePull test case execution
    /data/src/github.com/openshift/origin/test/extended/builds/forcepull.go:162

    Expected error:
        <*errors.errorString | 0xc208d39d10>: {
            s: "Error: image extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder:latest not found",
        }
        Error: image extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder:latest not found
    not to have occurred

    /data/src/github.com/openshift/origin/test/extended/builds/forcepull.go:108
------------------------------

Which correlates to

time="2016-04-18T20:59:49.343979912Z" level=debug msg="authorizing request" go.version=go1.6 http.request.host="172.30.165.70:5000" http.request.id=136f9565-33ec-484c-9625-7c5cd64ba3e5 http.request.method=GET http.request.remoteaddr="172.18.2.57:44975" http.request.uri="/v2/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/manifests/latest" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" instance.id=347e5f58-43de-4ef6-af03-9ff2612525c2 vars.name="extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder" vars.reference=latest 
time="2016-04-18T20:59:49.344662006Z" level=debug msg="Origin auth: checking for access to repository:extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder:pull" go.version=go1.6 http.request.host="172.30.165.70:5000" http.request.id=136f9565-33ec-484c-9625-7c5cd64ba3e5 http.request.method=GET http.request.remoteaddr="172.18.2.57:44975" http.request.uri="/v2/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/manifests/latest" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" instance.id=347e5f58-43de-4ef6-af03-9ff2612525c2 vars.name="extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder" vars.reference=latest 
time="2016-04-18T20:59:49.362894642Z" level=debug msg=GetImageManifest go.version=go1.6 http.request.host="172.30.165.70:5000" http.request.id=136f9565-33ec-484c-9625-7c5cd64ba3e5 http.request.method=GET http.request.remoteaddr="172.18.2.57:44975" http.request.uri="/v2/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/manifests/latest" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" instance.id=347e5f58-43de-4ef6-af03-9ff2612525c2 vars.name="extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder" vars.reference=latest 
172.18.2.57 - - [18/Apr/2016:20:59:49 +0000] "GET /v2/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/manifests/latest HTTP/1.1" 404 339 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"
time="2016-04-18T20:59:49.393081944Z" level=error msg="Error getting ImageStreamTag \"latest\": imagestreams \"forcepull-extended-test-builder\" not found" go.version=go1.6 http.request.host="172.30.165.70:5000" http.request.id=136f9565-33ec-484c-9625-7c5cd64ba3e5 http.request.method=GET http.request.remoteaddr="172.18.2.57:44975" http.request.uri="/v2/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/manifests/latest" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" instance.id=347e5f58-43de-4ef6-af03-9ff2612525c2 vars.name="extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder" vars.reference=latest 
time="2016-04-18T20:59:49.393259611Z" level=error msg="response completed with error" err.code="MANIFEST_UNKNOWN" err.detail="imagestreams \"forcepull-extended-test-builder\" not found" err.message="manifest unknown" go.version=go1.6 http.request.host="172.30.165.70:5000" http.request.id=136f9565-33ec-484c-9625-7c5cd64ba3e5 http.request.method=GET http.request.remoteaddr="172.18.2.57:44975" http.request.uri="/v2/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/manifests/latest" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=51.276079ms http.response.status=404 http.response.written=339 instance.id=347e5f58-43de-4ef6-af03-9ff2612525c2 vars.name="extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder" vars.reference=latest 
172.18.2.57 - - [18/Apr/2016:20:59:49 +0000] "GET /v1/repositories/extended-test-forcepull-hbkcx-4v8vi/forcepull-extended-test-builder/images HTTP/1.1" 404 19 "" "docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64"
smarterclayton commented 8 years ago

This kind of looks like a race condition.

smarterclayton commented 8 years ago

I think this is fixed in https://github.com/openshift/origin/pull/8558, the docker hub occasionally returns 401 on import. Need to clean the fix up and make it even more targeted (discuss whether to preserve this for DTR, which I bet has this problem as well).

soltysh commented 8 years ago

I'm reopening for the time of debugging the failure in https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/301/consoleFull

soltysh commented 8 years ago

@smarterclayton after careful investigation the only two operations we do not retry are store.Open and io.CopyN from pullthroughBlobStore. The logs evidently shows the io.CopyN is the culprit in the output you've pointed me to, see:

time="2016-04-25T15:36:57.066474262Z" level=info msg="Copying 217 bytes of type \"application/octet-stream\" for \"sha256:7e224052b3b1fc91443295519f43edf302997496c13dd970c122f189485e0178\"" 
time="2016-04-25T15:36:57.090276777Z" level=error msg="Failed copying content from remote store \"sha256:7e224052b3b1fc91443295519f43edf302997496c13dd970c122f189485e0178\": unauthorized: authentication required"

The question is, should we retry, since apparently this is hitting the issue the great @liggitt identified, about hitting a clock-skewed server.

smarterclayton commented 8 years ago

What backs the input stream that is returned by open?

On Apr 27, 2016, at 6:14 AM, Maciej Szulik notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton after careful investigation the only two operations we do not retry are store.Open https://github.com/openshift/origin/blob/6f6d94c5d3878924be8c21de9064858a5a5889bf/pkg/dockerregistry/server/pullthroughblobstore.go#L118 and io.CopyN https://github.com/openshift/origin/blob/6f6d94c5d3878924be8c21de9064858a5a5889bf/pkg/dockerregistry/server/pullthroughblobstore.go#L127 from pullthroughBlobStore. The logs evidently shows the io.CopyN is the culprit in the output you've pointed me to, see:

time="2016-04-25T15:36:57.066474262Z" level=info msg="Copying 217 bytes of type \"application/octet-stream\" for \"sha256:7e224052b3b1fc91443295519f43edf302997496c13dd970c122f189485e0178\"" time="2016-04-25T15:36:57.090276777Z" level=error msg="Failed copying content from remote store \"sha256:7e224052b3b1fc91443295519f43edf302997496c13dd970c122f189485e0178\": unauthorized: authentication required"

The question is, should we retry, since apparently this is hitting the issue the great @liggitt https://github.com/liggitt identified, about hitting a clock-skewed server.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/8399#issuecomment-215040124

soltysh commented 8 years ago

Per this comment I'm proposing to wait a bit and see what happens and how fast their fix will be actually live. Additionally talking to @smarterclayton on IRC he's not too much comfortable with us retrying reading bytes.

soltysh commented 8 years ago

The same problem

time="2016-04-28T08:01:16.159905939Z" level=error msg="Failed copying content from remote store \"sha256:b216856605521bf39a8784690d432d80d9a6ee20323d6c6d9972715f825676f7\": unauthorized: authentication required"

appeared in https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/450/ (the linked PR).

smarterclayton commented 8 years ago

I think this is resolved.

soltysh commented 8 years ago

I wouldn't be that sure, esp. that the issue is silent for past month... I'd love to be that way...

smarterclayton commented 8 years ago

I added retries to a few other places - I think this is fixed upstream, because I haven't seen retries in the logs when I've been looking at other issues.