503 when trying to upload with EC storage policy

M-Pixel commented 3 years ago

ISSUE TYPE

Bug Report

COMPONENT NAME

oioswift (maybe?)

SDS VERSION

openio 7.0.1

CONFIGURATION

# OpenIO managed
[OPENIO]
# endpoints
conscience=10.147.19.4:6000
zookeeper=10.147.19.2:6005,10.147.19.3:6005,10.147.19.4:6005
proxy=10.147.19.2:6006
event-agent=beanstalk://10.147.19.2:6014
ecd=10.147.19.2:6017

udp_allowed=yes

ns.meta1_digits=2
ns.storage_policy=ECLIBEC144D1
ns.chunk_size=104857600
ns.service_update_policy=meta2=KEEP|3|1|;rdir=KEEP|1|1|;

iam.connection=redis+sentinel://10.147.19.2:6012,10.147.19.3:6012,10.147.19.4:6012?sentinel_name=OPENIO-master-1
container_hierarchy.connection=redis+sentinel://10.147.19.2:6012,10.147.19.3:6012,10.147.19.4:6012?sentinel_name=OPENIO-master-1
bucket_db.connection=redis+sentinel://10.147.19.2:6012,10.147.19.3:6012,10.147.19.4:6012?sentinel_name=OPENIO-master-1

sqliterepo.repo.soft_max=1000
sqliterepo.repo.hard_max=1000
sqliterepo.cache.kbytes_per_db=4096

OS / ENVIRONMENT

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

SUMMARY

When trying to upload a file that is large enough to match an erasure-code storage policy, a 503 error is returned.

I confirmed that files small enough to match a simple (replication) storage policy do not experience the same issue. I confirmed that the issue is not unique to a particular EC implementation (ISA-L vs libEC).

I could not identify any relevant diagnostic information in any of the logs that I thought might be relevant. Before going through the effort of providing an exact repro, I was hoping I could get guidance on how I can more precisely identify the problem.

I do have one suspicion: My cluster has 3 RAWX at the moment. Perhaps the inability to locate 6 unique RAWX for ECLIBEC63D1's 6 data chunks is causing a timeout that results in the 503?

fvennetier commented 3 years ago

Hello. For ECLIBEC63D1 you need 9 rawx services. The internal load balancer won't accept to send 2 chunks on the same service. Look for "no service polled from" in the logs of oioswift-proxy-server, oio-proxy or oio-meta2-server to confirm.

M-Pixel commented 3 years ago

This can indeed be found in my /var/log/oio/sds/OPENIO/oioproxy-0/oioproxy-0.log file:

warning  1341 42D4 log WRN oio.core no service polled from [rawx], 3/9 services polled

In retrospect, the warning makes sense, however it could definitely be more clear ("Request for 9 services from pool [rawx] could not be fulfilled because only 3 services currently exist in that pool"). In addition to the warning, I would hope that at least one of the services would log an error. It does make sense that this particular one is a warning (client asks for impossible thing != server malfunction). However, by comparison, when I had configured a non-existent storage policy, I did get an explicit error logged from one of the services (I don't remember which).

open-io / oio-sds