rate limit for run-on-reana launch endpoint

tiborsimko commented 2 years ago

Currently, REANA cluster administrators can set API endpoint rate limiting via RATELIMIT_GUEST_USER and REANA_RATELIMIT_GUEST_USER environment variables. The default value is "20 per second".

This is generally OK for "fast" endpoints, but it may be too much for "slow" endpoints.

For the Run-on-REANA sprint, we shall have a run?from=... like endpoint which will gather the workflow specification from external sources. If the specification is living inside big tarball, it may take several seconds to get the sources and start a workflow.

This could bring our cluster to the knees.

It is therefore good to investigate solutions to prevent this:

For example, on the app-level, Flask-Limiter can have different rates for different endpoints. We can simply decorate different endpoints differently and expose that to cluster admins. A complex setup of many different environment variables, or a variable encoding a Python dictionary, could be envisaged. But a simple setup of distinguishing only two cases, a "fast" endpoint and a "slow" endpoint, with two different corresponding rate limiting values, may already be enough.
For example, on the devops-level, we could investigate traefik rate-limiting or haproxy maxconn which would allow only N requests and queue the others until these are solved, preventing cluster overload.

We should investigate best options and implement either an in-app solution or an external-service solution to prevent cluster overload when many hundreds of users would click on "Run-on-REANA" badge at the same time.

VMois commented 2 years ago

After investigation, I found that you can define custom limits for each endpoint in invenio-app using RATELIMIT_PER_ENDPOINT (details).

If I understood correctly, our main goal is to prevent cluster overload. In this case, the above method will not help. invenio-app stores rate information per some key. Their default key is based on IP addresses + User-Agent (details). This means that RATELIMIT_PER_ENDPOINT will prevent the same user from clicking a lot of times on the launch/ endpoint, but will not prevent many different users to do the same.

One possible solution, we can create a new Flask-Limiter and configure it to use the endpoint name as a key.

Will continue the investigation.

VMois commented 2 years ago

@tiborsimko should I add a configurable rate limit for launch/ endpoint? Like REANA_LAUNCH_RATE_LIMIT in helm chart?

reanahub / reana-server

rate limit for run-on-reana launch endpoint #443