ucphhpc / migrid-sync

MiGrid workspace where master branch is kept strictly in sync with SF upstream svn repo. Any development or experiments should use a branch. You probably want to fork your own clone or work e.g. on the edge branch if you wish to contribute.
GNU General Public License v2.0
4 stars 4 forks source link

There should only be a single download endpoint #119

Open Bjarke42 opened 2 months ago

Bjarke42 commented 2 months ago

There are several download endpoint URLs seens as user on migrid, which depends on how much data is being downloaded.

For files smaller than 64MB you will use: https://erda.dk/wsgi-bin/cat.py?path=file1&output_format=file

But if that file is larger than 64MB this url will be used: https://erda.dk/cert_redirect/file1

This is all fine as long as the user is using the migrid web interface as intended, but as soon as anyone wants to make automation scripts that downloads files via web, migrid server will fail. This could be because of shared links access to the system for a user or specific application only allows web access, but automation is needed.

We are seeing migrid server becoming non responsive because of excessive ram usage of the http process that loads the entire file into memory before downloading begins via cat.py. If you are lucky it will oom kill the http process, but most cases we just see migrid server become non responsive for 5 minute to hours.

I will suggest this be changed so that there is only one way to download, and that is always the correct way which cannot in anyway, by misusage, or similar cause migrid server to become non responsive.

If the downloading is sucessful, which will not always happens, you can see the following in the mig.log:

2024-09-12 13:42:56,227 INFO WSGI cat yielding 8 output parts (1920000000b)
2024-09-12 13:43:23,302 INFO WSGI cat finished yielding all 8 output parts

Work around does not exist, but if it happens and you already is logged in on the server you can do a kill -9 on all tini processes, which will end the mayhem.

jonasbardino commented 2 months ago

Thanks for reporting the issue @Bjarke42 . We're looking into it.

albu-diku commented 1 month ago

Hi there,

My name is Alex, I've done the majority of the work around this issue and want to provide an update.

We took some time to look over how best to solve this, both in terms of an immediate fix and what we might do going forward. We think there are opportunities to address this problem at a more fundamental level, and it is my belief that we could do so without need of a file serving limit, but this would require changes that need a much greater degree of verification.

As such we have implemented a file serving limit with the intent to include this in the next release for the interim. The limit is a configuration parameter representing a maximum number of bytes to serve, and once above that limit it causes the return an error to encourage users away from the web UI towards other means of retrieving files.

We hope this will help address the immediate problem that you've faced and separately we will continue longer term work.

Thanks.

jonasbardino commented 1 month ago

Just for the record the first iteration outlined by @albu-diku above has landed and will be included in the upcoming release soon. Additional work to hopefully completely eliminate the excessive in-memory caching issue will follow.