rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.31k stars 2.69k forks source link

HEALTH_WARN 1 large omap objects #2221

Closed dimm0 closed 5 years ago

dimm0 commented 5 years ago

Is this a bug report or feature request?

Deviation from expected behavior: When creating a large S3 bucket, I see the error in cluster health:

HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
    1 large objects found in pool 'rooks3.rgw.buckets.index'
    Search the cluster log for 'Large omap object found' for more details.

Expected behavior: I expect the dynamic resharding feature to take care of shards.

https://tracker.ceph.com/issues/24457

http://docs.ceph.com/docs/mimic/radosgw/dynamicresharding/ is mentioning the "rgw_dynamic_resharding" option set to true by default. Where can I find one?

How to reproduce it (minimal and precise): Create a bucket with a couple millions of files / TBs of space

Environment:

dimm0 commented 5 years ago

Already have 2 objects

travisn commented 5 years ago

@dimm0 Rook doesn't do anything currently to configure the resharding, but according to the docs resharding is enabled by default. Perhaps you need to add the bucket to the resharding queue with the command from the toolbox?

 radosgw-admin reshard add --bucket <bucket_name> --num-shards <new number of shards>
jjgraham commented 5 years ago

Now fighting LARGE_OMAP_OBJECTS 21 large omap objects 21 large objects found in pool 'rooks3.rgw.buckets.index' Search the cluster log for 'Large omap object found' for more details.

dimm0 commented 5 years ago

radosgw-admin reshard is done on large buckets, radosgw-admin bucket limit check is now happy for all buckets

dimm0 commented 5 years ago

https://www.spinics.net/lists/ceph-users/msg49054.html

Seems like here's the fix. Is it something you could do in rook? Everybody having large buckets will have this problem..

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dimm0 commented 5 years ago

Still suffering from it. Cluster is never 100% healthy

dimm0 commented 5 years ago

I see a command to deal with those in 12.2.11 https://ceph.com/releases/v12-2-11-luminous-released/

There have been fixes to RGW dynamic and manual resharding, which no longer
leaves behind stale bucket instances to be removed manually. For finding and
cleaning up older instances from a reshard a radosgw-admin command reshard
stale-instances list and reshard stale-instances rm should do the necessary
cleanup.

Any way to get that version? The rook toolkit pod doesn't seem to have this command

dimm0 commented 5 years ago

Having 160 such objects now

dimm0 commented 5 years ago

Fixed in ceph 13.2.5

radosgw-admin reshard stale-instances rm

qrpike commented 2 years ago

We also had this same problem and it turned out to be failed multipart uploads. By default there is no lifecycle policy on S3 objects, and if you upload 999GB of a 1TB file and it fails, the multipart ( chunks ) data sticks around forever until you manually clean them up or add a lifecycle policy.

Each multipart chunk is listed inside the omap index files. However, the index does not count them in the sharding logic, so the omap index does not get more shards assuming those chunks are going to get cleaned up, or converted into a completed full object.

Adding lifecycle policies and doing a deep scrub solved the issue.