Closed tiborsimko closed 2 years ago
After investigation, I found that you can define custom limits for each endpoint in invenio-app
using RATELIMIT_PER_ENDPOINT
(details).
If I understood correctly, our main goal is to prevent cluster overload. In this case, the above method will not help. invenio-app
stores rate information per some key. Their default key is based on IP addresses + User-Agent (details). This means that RATELIMIT_PER_ENDPOINT
will prevent the same user from clicking a lot of times on the launch/
endpoint, but will not prevent many different users to do the same.
One possible solution, we can create a new Flask-Limiter and configure it to use the endpoint name as a key.
Will continue the investigation.
@tiborsimko should I add a configurable rate limit for launch/
endpoint? Like REANA_LAUNCH_RATE_LIMIT
in helm
chart?
Currently, REANA cluster administrators can set API endpoint rate limiting via
RATELIMIT_GUEST_USER
andREANA_RATELIMIT_GUEST_USER
environment variables. The default value is "20 per second".This is generally OK for "fast" endpoints, but it may be too much for "slow" endpoints.
For the Run-on-REANA sprint, we shall have a
run?from=...
like endpoint which will gather the workflow specification from external sources. If the specification is living inside big tarball, it may take several seconds to get the sources and start a workflow.This could bring our cluster to the knees.
It is therefore good to investigate solutions to prevent this:
For example, on the app-level, Flask-Limiter can have different rates for different endpoints. We can simply decorate different endpoints differently and expose that to cluster admins. A complex setup of many different environment variables, or a variable encoding a Python dictionary, could be envisaged. But a simple setup of distinguishing only two cases, a "fast" endpoint and a "slow" endpoint, with two different corresponding rate limiting values, may already be enough.
For example, on the devops-level, we could investigate traefik rate-limiting or haproxy maxconn which would allow only N requests and queue the others until these are solved, preventing cluster overload.
We should investigate best options and implement either an in-app solution or an external-service solution to prevent cluster overload when many hundreds of users would click on "Run-on-REANA" badge at the same time.