Open ortz opened 1 year ago
@ortz I understand why this happens for for HadoopFS, but not why it would happen for the metadata client with GC. There this should have been fixed by issue #4189 / PR #4190.
Did you observe GC getting stuck because of this?
I observed this before we abandoned the work on LakeFSOutputCommitter. Like LakeFSFS the committer would never retry a failed call, and API calls to lakeFS can fail under load.
The solution in LakeFSOutputCommitter was to add a retry wrapper (the same one from #4190 in fact) around calls to the lakeFS API. This will solve most of the issue: Spark has a fixed number of executors to use, and each executor writes in a single-threaded manner. So it should be enough to back off to succeed.
Here we would ideally also respect the Retry-After header on the 429 response, as a minimum retry interval. In practice such headers are optimized more towards single success than sustained throughput, so they may under-estimate the time to retry; consider adjust the value in any case.
@ortz I understand why this happens for for HadoopFS, but not why it would happen for the metadata client with GC. There this should have been fixed by issue #4189 / PR #4190.
Did you observe GC getting stuck because of this?
I haven't run GC to test it TBH, haven't got to it yet. It's something we'll need to do.
The rate limit above is based on extending the lakeFS OpenAPI first to retry-after (backoff) response first and than add the client support. Note that the issue above can origin from client timeout, for example by default the generated java SDK uses 10-30 second per request timeout by default. This can cause the large number of request while lakeFS uses the underlying storage SDK which perform backoff and supports the above header/mechanism we would like to provide to our clients.
This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.
This is still an issue: server prevents rate-limiting on clients, which is poor behaviour.
See also #4957 , #5664 (which is blocked on fixing this), will add others as I find them.
What happened?
Current Behavior: I've launched a lakeFS instance with rate limiting configured in order to avoid throttling from the backend engine (e.g DynamoDB) and put a high water mark for the infrastructure's throughput. The task fails due to
ApiException: Too Many Requests
produced by the client, even if the task retries, due to its nature of retrying the same thing, it'll fail again, and eventually fail the job.Example Jupyter notebook I used to test it:
Expected Behavior
Expected Behavior: I would expect lakeFS clients to handle HTTP Status Code 429 (and possibly the additional response headers, such as
Retry-After
) and buffer/hold requests instead of failing.lakeFS Version
0.90
Deplyoment
Kubernetes, Helm Chart, DynamoDB
Affected Clients
No response
Relevant logs output
Contact Details
No response