Closed first-operator[bot] closed 2 years ago
@larsks any ideas?
@HumairAK you can't run ceph health detail
by itself because it won't use the correct credentials. You need to pass the --id
flag. Assuming that /etc/ceph/ceph.client.provisioner-moc-rbd-1.keyring
exists, you can run:
sh-4.4$ ceph --id provisioner-moc-rbd-1 health
HEALTH_WARN 5 clients failing to respond to cache pressure; 11 nearfull osd(s); 1 pool(s) full; 23 pool(s) nearfull; 1 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time
Or:
sh-4.4$ ceph --id provisioner-moc-rbd-1 health detail
HEALTH_WARN 5 clients failing to respond to cache pressure; 11 nearfull osd(s); 1 pool(s) full; 23 pool(s) nearfull; 1 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time
MDS_CLIENT_RECALL 5 clients failing to respond to cache pressure
mdsmds02(mds.0): Client holy-es-dev08.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22297261
mdsmds02(mds.0): Client holy-es-dev06.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22336885
mdsmds02(mds.0): Client holy-es-dev01.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22336890
mdsmds02(mds.0): Client holy-es-dev05.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22336900
mdsmds02(mds.0): Client holy-es-dev07.rc.fas.harvard.edu:tata failing to respond to cache pressure client_id: 22377899
OSD_NEARFULL 11 nearfull osd(s)
osd.24 is near full
osd.262 is near full
osd.430 is near full
osd.531 is near full
osd.559 is near full
osd.623 is near full
osd.717 is near full
osd.731 is near full
osd.942 is near full
osd.1437 is near full
osd.1710 is near full
POOL_FULL 1 pool(s) full
...
POOL_NEARFULL 23 pool(s) nearfull
...
pool 'moc_rbd_1' is nearfull
PG_NOT_DEEP_SCRUBBED 1 pgs not deep-scrubbed in time
pg 16.b2f not deep-scrubbed since 2021-07-24 19:33:45.513409
PG_NOT_SCRUBBED 2 pgs not scrubbed in time
pg 16.b2f not scrubbed since 2021-08-06 21:58:40.829489
pg 8.1914 not scrubbed since 2021-10-13 04:09:26.367731
I can successfully list, create, and delete rbds:
sh-4.4$ rbd --id provisioner-moc-rbd-1 --pool moc_rbd_1 ls
...
sh-4.4$ rbd --id provisioner-moc-rbd-1 --pool moc_rbd_1 create lars-test-image --size 10G
sh-4.4$ rbd --id provisioner-moc-rbd-1 --pool moc_rbd_1 rm lars-test-image
Removing image: 100% complete...done.
It looks like the problem is only with OBCs; I was able to create a PVC and it was bound to a PV immediately.
I am able to create OBCs, but it takes them about 60 seconds to bind. I'm going to open a support case to see if we can find someone who knows about Noobaa, but generally it seems to be working.
Starting with this manifest:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: test-obc
spec:
generateBucketName: test-obc-
storageClassName: openshift-storage.noobaa.io
I can create it:
$ oc -n default apply -f test-obc.yaml
objectbucketclaim.objectbucket.io/test-obc created
After about 60 seconds it becomes Bound
:
$ oc -n default get obc
NAME STORAGE-CLASS PHASE AGE
test-obc openshift-storage.noobaa.io Bound 55s
Attempting to list the contents of the (empty) obc is slow but does complete:
$ mcli alias set obc https://s3-openshift-storage.apps.smaug.na.operate-first.cloud ...
$ time mcli ls obc/test-obc--8499de20-fcb5-481f-966b-72c7e6146a54/
[2021-11-11 16:17:29 EST] 174B test-obc.yaml
real 0m11.610s
user 0m0.043s
sys 0m0.013s
But attempting to copy a file to the bucket gets stuck:
$ mcli cp test-obc.yaml obc/test-obc--8499de20-fcb5-481f-966b-72c7e6146a54/
mcli: <ERROR> Failed to copy `tmp/k8s/test-obc.yaml`. 504 Gateway Time-out
@HumairAK while I am filling out a support case, can you please verify that this problem is not affecting PVs at all? Just a simple test that binds a PV, writes to it, reads it back, etc.
@larsks tested, pv/pvc binding read/write works fine.
I was attempting to reproduce the problem in order to open a case, but it looks as if the problem has resolved itself. Just now I was able to create an OBC, it bound in a matter of seconds, and I have no problem writing to it/reading from it.
Starting with this manifest:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: test-obc
spec:
generateBucketName: test-obc-
storageClassName: openshift-storage.noobaa.io
I can create a new OBC:
$ oc apply -f test-obc.yaml
$ time sh -c 'until oc get obc test-obc | grep -q Bound; do sleep 1; done'
real 0m3.317s
user 0m0.427s
sys 0m0.096s
From a pod running in the same namespace...
$ oc -n default run awscli --image docker.io/amazon/aws-cli \
--command -- sleep inf
$ oc rsh awscli
...I can see the resulting bucket:
sh-4.2# export AWS_SECRET_ACCESS_KEY=...
sh-4.2# export AWS_ACCESS_KEY_ID=...
sh-4.2# aws --ca-bundle /run/secrets/kubernetes.io/serviceaccount/service-ca.crt \
--endpoint-url https://s3.openshift-storage.svc \
s3api list-buckets
{
"Buckets": [
{
"Name": "test-obc--37e93c0e-4044-4510-80ff-46b563dd1313",
"CreationDate": "2021-11-12T03:24:14+00:00"
}
],
"Owner": {
"DisplayName": "NooBaa",
"ID": "123"
}
}
I can uploading an object:
sh-4.2# aws --ca-bundle /run/secrets/kubernetes.io/serviceaccount/service-ca.crt \
--endpoint-url https://s3.openshift-storage.svc s3api \
put-object --bucket test-obc--37e93c0e-4044-4510-80ff-46b563dd1313 \
--key test-object --body /etc/bashrc
{
"ETag": "\"3f48a33cc1fce59ff2df86429151c0e0\""
}
I can download the object:
sh-4.2# aws --ca-bundle /run/secrets/kubernetes.io/serviceaccount/service-ca.crt \
--endpoint-url https://s3.openshift-storage.svc s3api \
get-object --bucket test-obc--37e93c0e-4044-4510-80ff-46b563dd1313 \
--key test-object test-object
{
"AcceptRanges": "bytes",
"LastModified": "2021-11-12T03:26:58+00:00",
"ContentLength": 2853,
"ETag": "\"3f48a33cc1fce59ff2df86429151c0e0\"",
"ContentType": "application/octet-stream",
"Metadata": {}
}
sh-4.2# cat test-object
# /etc/bashrc
...
I'm going to close the support case for now, because without a reproducible problem this is going to be hard to resolve.
See https://github.com/CCI-MOC/pvc-obc-example for a deployable example if you want to try the steps from the previous comment.
@harshad16 @codificat can you confirm that there are no more issues with the s3 buckets on your end?
i confirm there is no issue as of now on the s3 bucket. thanks. closing the issue.
When accessing a bucket on smaug provisioned using OCS Ceph, we keep getting gateway timeouts.
Checking ceph health in the tools pod shows:
Slack discussion
> **Humair Khan wrote:** > Harshad ci seems to be timing out > > **Pep Turró Mauri wrote:** > Harshad is traveling today (conference)... I did notice that and was trying to take a look > > **Pep Turró Mauri wrote:** > but I'm not Harshad... so it's still not fixed > > **Pep Turró Mauri wrote:** > what I've seen so far: pods get stuck at one of their init containers, `initupload`. That container is supposed to upload pod init logs to s3 > > **Pep Turró Mauri wrote:** > is there any issue with ceph or something that coud be causing it to get stuck? > > **Pep Turró Mauri wrote:** > e.g. found this in the logs of one of the recent pods: > ```{"component":"initupload","dest":"pr-logs/pull/thoth-station_user-api/1507/pre-commit/latest-build.txt","file":"prow/pod-utils/gcs/upload.go:112","func":"