OCS Ceph buckets timing out

first-operator[bot] commented 2 years ago

When accessing a bucket on smaug provisioned using OCS Ceph, we keep getting gateway timeouts.

Checking ceph health in the tools pod shows:

Slack discussion

> **Humair Khan wrote:** > Harshad ci seems to be timing out > > **Pep Turró Mauri wrote:** > Harshad is traveling today (conference)... I did notice that and was trying to take a look > > **Pep Turró Mauri wrote:** > but I'm not Harshad... so it's still not fixed > > **Pep Turró Mauri wrote:** > what I've seen so far: pods get stuck at one of their init containers, `initupload`. That container is supposed to upload pod init logs to s3 > > **Pep Turró Mauri wrote:** > is there any issue with ceph or something that coud be causing it to get stuck? > > **Pep Turró Mauri wrote:** > e.g. found this in the logs of one of the recent pods: > ```{"component":"initupload","dest":"pr-logs/pull/thoth-station_user-api/1507/pre-commit/latest-build.txt","file":"prow/pod-utils/gcs/upload.go:112","func":"","level":"info","msg":"Failed upload","severity":"info","time":"2021-11-11T18:11:01Z"}``` > > **Pep Turró Mauri wrote:** > config for this is `INITUPLOAD_OPTIONS={"bucket":"",...` ... can you quickly verify the status of that bucket? > > **Pep Turró Mauri wrote:** > I get a 504 if I try to browse via web, but I'm not sure if that's expected > > **Pep Turró Mauri wrote:** > I guess not... but I'm not familiar with that ceph env, and some proper verification would help > > **Humair Khan wrote:** > sorry was afk, reading up > > **Harshad wrote:** > Ceph env are added up to 2 components > Hooks and tide > > **Harshad wrote:** > Lets check up on those two > > **Harshad wrote:** > I will check it in few mins > > **Humair Khan wrote:** > haven't seen anyoen report any ceph issues yet > > **Humair Khan wrote:** > maybe try connecting to it manually? > > **Harshad wrote:** > Btw humair, when you say its timing out, so are referring to the job timing out right > > **Humair Khan wrote:** > all I saw was the jobs fail and when I clicked `details` it gave me a 504 > > **Humair Khan wrote:** > didn't investigate further > > **Humair Khan wrote:** > e.g. > clicking details results in 504 page > > **Humair Khan wrote:** > > > **Pep Turró Mauri wrote:** > the 504 from the web UI (deck) also points to ceph it seems. This is in the logs: `{"component":"deck","error":"timed out: error accessing GCS artifact: blob (code=Unknown): RequestCanceled: request context canceled\ncaused by: context canceled","file":"prow/spyglass/artifacts.go:61","func":"","level":"warning","msg":"error retrieving artifact names from gcs storage","severity":"warning","time":"2021-11-11T18:58:32Z"}` > > **Pep Turró Mauri wrote:** > I also get 504s when trying to list using an s3 client > > **Harshad wrote:** > good point pep, i cant access the s3 bucket from cli as well > > **Pep Turró Mauri wrote:** > > > **Harshad wrote:** > ```aws s3 --endpoint --profile smaug-thoth ls > > An error occurred (504) when calling the ListObjectsV2 operation (reached max retries: 4): Gateway Timeout``` > Humair Khan there is a gateway timeout > > **Harshad wrote:** > can you check on that > > **Humair Khan wrote:** > okay will check in a bit > > **Harshad wrote:** > so to give more details > > **Harshad wrote:** > the bucket can be seen but the data in that can be reached > > **Humair Khan wrote:** > this is the `ci-prow` bucket on smaug ? > > **Harshad wrote:** > ```hnalla@workstation ~ $ aws s3 --endpoint --profile smaug-prow ls s3:// > 2021-11-11 13:39:41 ci-prow > > hnalla@workstation ~ $ aws s3 --endpoint --profile smaug-prow ls > > An error occurred (504) when calling the ListObjectsV2 operation (reached max retries: 4): Gateway Timeout``` > > **Harshad wrote:** > yes that is the bucket > > **Harshad wrote:** > The details: > The ui is working for prow > howveer will start to prompt 504 error, as one of the components which is to connect with the data s3 is unavialable to do so, > that is reason for the disruption > > **Humair Khan wrote:** > yep also confirm, can't list objects in this bucket > > **Humair Khan wrote:** > going to try other buckets > > **Harshad wrote:** > sounds good > > **Humair Khan wrote:** > same deal > > **Humair Khan wrote:** > time to call batman > > **Humair Khan wrote:** > larsks -- having some issues connecting to ceph s3 buckets on smaug > > **Humair Khan wrote:** > atm I've tried 2 different buckets and it seems to time out > > **Humair Khan wrote:** > hrm... > > **Humair Khan wrote:** > > > **Humair Khan wrote:** > doesn't seem good > > **Humair Khan wrote:** > lol > > **larsks wrote:** > Humair Khan can you open an issue with details? I will take a look shortly. > _Transcript of Slack thread: https://operatefirst.slack.com/archives/C01RT0S2WKX/p1636652679228400?thread_ts=1636652679.228400&cid=C01RT0S2WKX_

HumairAK commented 2 years ago

@larsks any ideas?

larsks commented 2 years ago

@HumairAK you can't run ceph health detail by itself because it won't use the correct credentials. You need to pass the --id flag. Assuming that /etc/ceph/ceph.client.provisioner-moc-rbd-1.keyring exists, you can run:

sh-4.4$ ceph --id provisioner-moc-rbd-1 health
HEALTH_WARN 5 clients failing to respond to cache pressure; 11 nearfull osd(s); 1 pool(s) full; 23 pool(s) nearfull; 1 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time

Or:

sh-4.4$ ceph --id provisioner-moc-rbd-1 health detail
HEALTH_WARN 5 clients failing to respond to cache pressure; 11 nearfull osd(s); 1 pool(s) full; 23 pool(s) nearfull; 1 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time
MDS_CLIENT_RECALL 5 clients failing to respond to cache pressure
    mdsmds02(mds.0): Client holy-es-dev08.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22297261
    mdsmds02(mds.0): Client holy-es-dev06.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22336885
    mdsmds02(mds.0): Client holy-es-dev01.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22336890
    mdsmds02(mds.0): Client holy-es-dev05.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22336900
    mdsmds02(mds.0): Client holy-es-dev07.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22377899
OSD_NEARFULL 11 nearfull osd(s)
    osd.24 is near full
    osd.262 is near full
    osd.430 is near full
    osd.531 is near full
    osd.559 is near full
    osd.623 is near full
    osd.717 is near full
    osd.731 is near full
    osd.942 is near full
    osd.1437 is near full
    osd.1710 is near full
POOL_FULL 1 pool(s) full
    ...
POOL_NEARFULL 23 pool(s) nearfull
    ...
    pool 'moc_rbd_1' is nearfull
PG_NOT_DEEP_SCRUBBED 1 pgs not deep-scrubbed in time
    pg 16.b2f not deep-scrubbed since 2021-07-24 19:33:45.513409
PG_NOT_SCRUBBED 2 pgs not scrubbed in time
    pg 16.b2f not scrubbed since 2021-08-06 21:58:40.829489
    pg 8.1914 not scrubbed since 2021-10-13 04:09:26.367731

larsks commented 2 years ago

I can successfully list, create, and delete rbds:

sh-4.4$ rbd --id provisioner-moc-rbd-1 --pool moc_rbd_1 ls
...
sh-4.4$ rbd --id provisioner-moc-rbd-1 --pool moc_rbd_1 create lars-test-image --size 10G
sh-4.4$ rbd --id provisioner-moc-rbd-1 --pool moc_rbd_1 rm lars-test-image
Removing image: 100% complete...done.

larsks commented 2 years ago

It looks like the problem is only with OBCs; I was able to create a PVC and it was bound to a PV immediately.

larsks commented 2 years ago

I am able to create OBCs, but it takes them about 60 seconds to bind. I'm going to open a support case to see if we can find someone who knows about Noobaa, but generally it seems to be working.

larsks commented 2 years ago

Starting with this manifest:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: test-obc
spec:
  generateBucketName: test-obc-
  storageClassName: openshift-storage.noobaa.io

I can create it:

$ oc -n default apply -f test-obc.yaml
objectbucketclaim.objectbucket.io/test-obc created

After about 60 seconds it becomes Bound:

$ oc -n default get obc
NAME       STORAGE-CLASS                 PHASE   AGE
test-obc   openshift-storage.noobaa.io   Bound   55s

Attempting to list the contents of the (empty) obc is slow but does complete:

$  mcli alias set obc https://s3-openshift-storage.apps.smaug.na.operate-first.cloud ...
$ time mcli ls obc/test-obc--8499de20-fcb5-481f-966b-72c7e6146a54/
[2021-11-11 16:17:29 EST]   174B test-obc.yaml

real    0m11.610s
user    0m0.043s
sys     0m0.013s

But attempting to copy a file to the bucket gets stuck:

$ mcli cp test-obc.yaml obc/test-obc--8499de20-fcb5-481f-966b-72c7e6146a54/
mcli: <ERROR> Failed to copy `tmp/k8s/test-obc.yaml`. 504 Gateway Time-out

larsks commented 2 years ago

@HumairAK while I am filling out a support case, can you please verify that this problem is not affecting PVs at all? Just a simple test that binds a PV, writes to it, reads it back, etc.

HumairAK commented 2 years ago

@larsks tested, pv/pvc binding read/write works fine.

larsks commented 2 years ago

I was attempting to reproduce the problem in order to open a case, but it looks as if the problem has resolved itself. Just now I was able to create an OBC, it bound in a matter of seconds, and I have no problem writing to it/reading from it.

Starting with this manifest:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: test-obc
spec:
  generateBucketName: test-obc-
  storageClassName: openshift-storage.noobaa.io

I can create a new OBC:

$ oc apply -f test-obc.yaml
$ time sh -c 'until oc get obc test-obc | grep -q Bound; do sleep 1; done'
real    0m3.317s
user    0m0.427s
sys     0m0.096s

From a pod running in the same namespace...

$ oc -n default run awscli --image docker.io/amazon/aws-cli \
  --command -- sleep inf
$ oc rsh awscli

...I can see the resulting bucket:

sh-4.2# export AWS_SECRET_ACCESS_KEY=...
sh-4.2# export AWS_ACCESS_KEY_ID=...
sh-4.2# aws --ca-bundle /run/secrets/kubernetes.io/serviceaccount/service-ca.crt \
  --endpoint-url https://s3.openshift-storage.svc \
  s3api list-buckets
{
    "Buckets": [
        {
            "Name": "test-obc--37e93c0e-4044-4510-80ff-46b563dd1313",
            "CreationDate": "2021-11-12T03:24:14+00:00"
        }
    ],
    "Owner": {
        "DisplayName": "NooBaa",
        "ID": "123"
    }
}

I can uploading an object:

sh-4.2# aws --ca-bundle /run/secrets/kubernetes.io/serviceaccount/service-ca.crt \
  --endpoint-url https://s3.openshift-storage.svc s3api \
  put-object --bucket test-obc--37e93c0e-4044-4510-80ff-46b563dd1313 \
  --key test-object --body /etc/bashrc
{
    "ETag": "\"3f48a33cc1fce59ff2df86429151c0e0\""
}

I can download the object:

sh-4.2# aws --ca-bundle /run/secrets/kubernetes.io/serviceaccount/service-ca.crt \
  --endpoint-url https://s3.openshift-storage.svc s3api \
  get-object --bucket test-obc--37e93c0e-4044-4510-80ff-46b563dd1313 \
  --key test-object test-object
{
    "AcceptRanges": "bytes",
    "LastModified": "2021-11-12T03:26:58+00:00",
    "ContentLength": 2853,
    "ETag": "\"3f48a33cc1fce59ff2df86429151c0e0\"",
    "ContentType": "application/octet-stream",
    "Metadata": {}
}
sh-4.2# cat test-object
# /etc/bashrc
...

I'm going to close the support case for now, because without a reproducible problem this is going to be hard to resolve.

larsks commented 2 years ago

See https://github.com/CCI-MOC/pvc-obc-example for a deployable example if you want to try the steps from the previous comment.

HumairAK commented 2 years ago

@harshad16 @codificat can you confirm that there are no more issues with the s3 buckets on your end?

harshad16 commented 2 years ago

i confirm there is no issue as of now on the s3 bucket. thanks. closing the issue.

operate-first / support

OCS Ceph buckets timing out #475