Limit processing of queued files to 500 per job

observingClouds / slkspec

fsspec filesystem for stronglink tape archive

5 stars 2 forks source link

Limit processing of queued files to 500 per job #8

Open observingClouds opened 1 year ago

observingClouds commented 1 year ago

On the compute, shared and interactive partitions, slk retrieve is allowed to retrieve 500 files at once. Thus, if more than 500 files are requested here, it should be split into several retrievals. @antarcticrainforest Or would you split up the file list to parts <501 anyway before calling this function? I'll try to implement this feature ("group_files_by_tape") to the slk_helpers as soon as possible. But currently, I am mainly bound to slk testing and user support. So, let's see ;-) .

_Originally posted by @neumannd in https://github.com/observingClouds/slkspec/pull/3#discussion_r1035193456_

observingClouds commented 1 year ago

On the login nodes we allow slk retrieve to retrieve one file with one call of slk retrieve. There is a StrongLink config file in /etc/stronglink.conf which is JSON and contains an attribute "retrieve_file_limit":1 (on login nodes) or "retrieve_file_limit":500 (on other nodes). This file could be imported somewhere to find out how many files are allowed to be retrieved. Maybe, this number is changed in future if needed.

_Originally posted by @neumannd in https://github.com/observingClouds/slkspec/pull/3#discussion_r1035203280

florianziemen commented 1 year ago

I think in the case of a retrieve for more than 500 files (or maybe rather 10/... tapes) we should assume the user to have done a mistake, cancel the whole thing and throw an error. Otherwise we run into the problem that users might accidentally trigger loading half the HSM into the cache...

observingClouds commented 1 year ago

I see that 500 files can be a massive request and a mistake. I would argue that this limitation should however be done at the lowest level, like slk or at least pyslk. This would ensure that the behaviour is the same across all access methods and that slkspec remains more general and could also be used for a tape archive at a different institution who may have different resources. Instead of using a number of files as limit, one could also think of restricting a retrieval by size.

florianziemen commented 1 year ago

yeah, just saying that we should not try to bypass such limitations, b/c they are there for a reason.

observingClouds commented 1 year ago

I see where you are coming from. Retrievals are now for the most part combined into a single slk retrieve call. If slk has limitations in place, these will affect also slkspec retrievals.

neumannd commented 1 year ago

yeah, just saying that we should not try to bypass such limitations, b/c they are there for a reason.

Yes. We feared slk retrieve -R /arch . ;-)

It would be the savest to read out the retrieve_file_limit from this /etc/stronglink.conf.

neumannd commented 1 year ago

@observingClouds It would be reasonable to read /etc/stronglink.conf (example content):

{"host":"archive.dkrz.de","domain":"ldap","logSize":"10MB","retrieve_file_limit":500}

Then extract the value of retrieve_file_limit. Currently, it is 1 on levante login nodes and 500 on levante compute/interactive/shared nodes. This limit might be changed in future or on individual nodes (e.g. a "mass-data-retrieval-node" where it is set to 5000).

slk_conf_global="/etc/stronglink.conf"
# `-1` == no limit
retrieve_file_limit = -1
if os.path.exists(slk_conf_global):
  with open(slk_conf_global, 'r') as f:
    data = json.load(f)
  retrieve_file_limit = data.get("retrieve_file_limit", -1)