vdaas / vald

Vald. A Highly Scalable Distributed Vector Search Engine
https://vald.vdaas.org
Apache License 2.0
1.52k stars 77 forks source link

Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

Open rinx opened 4 years ago

rinx commented 4 years ago

related to #503, #556

Describe the issue:

currently, vald-agent-ngt pods have these containers:

agent-sidecar on initContainer mode may fail to complete to download backup data and it returns status code 0 (RST stream from remote host will cause this case). in this case, there may be fragments of backup data in the volume and they cause blocking of NGT startup (#503). the ideal behavior of the pods on the status like this is retrying to download backup data. however, a failing status of a container doesn't trigger pod restarts.

if there's liveness probe server in the pods, it can trigger pod restarts. however, agent-NGT has a postStop phase (it is executed after liveness probe killed) to save index. agent-sidecar has a postStop phase to upload index. so, it is required to improve internal/servers/server to handle these problems.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the label type/bug to this issue, with a confidence of 0.88. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.