Slurm: WORKDIR files overwritten on multistep-stage specs

Sinclert commented 3 years ago

This issue describes an undesirable behaviour found within the SlurmJobManagerCERN class, discovered between Carl Evans (NYU HPC) and myself (NYU CDS).

Context

We are currently trying to run a complex workflow (see madminer-workflow for reference) on REANA 0.7.3, using SLURM as the computational backend. The workflow specification is written in Yadage, and it is totally functional on REANA 0.7.1, when using Kubernetes as the computational backend.

Problem

The problem is found on any Yadage spec. using the multistep-stage scheduler_type value (where multiple "step-jobs" are run in parallel), when those "step-jobs" depend on scattered files to perform their computations.

In those scenarios, the SlurmJobManagerCERN._download_dir function, in addition to be somehow inefficient (it crawls through every file and directory in the SLURM workdir, making each step to scan everything all previous steps created), overrides the whole workflow WORKDIR at the start of each "step-job".

We have recently raised concerns about this behaviour on the REANA Mattermost channel (precisely here) where we thought the problem was due to the publisher_type within the Yadage specification. Turns out that was not the case, but instead it is due to the scheduler_type multistep-stage value.

Testing

We did some preliminary testing to properly identify the scope of the issue.

We are fairly sure the issue is located within the SlurmJobManagerCERN._download_dir function, as we have performed some testing on a custom reana-job-controller Docker image (where we have tuned this function and hardcoded some paths to our needs), and we were able to run the complete workflow successfully ✅

Possible solution

We believe a good patch would involve reducing the scope of the SlurmJobManagerCERN._download_dir function WORKDIR copying procedure, from the "workflow" level to the "step-job" level. That way, there will not be any overriding problems among parallel "step-jobs" within the same workflow stage.

Additional clarifications

This issue has not being detected in any of the workflows you guys use for testing because none of them use multistep-stage scheduler_type values, involving files. See:

@lukasheinrich offered to create a dummy workflow to test this behaviour, but no progress has been done so far (message).

cranmer commented 3 years ago

Hello, if I may add that this is quite time sensitive for the SCAILFIN project as it is tied to the scalability tests that we promised as a deliverable for the NSF grant. That grant ends at the end of this summer, so we were hoping to do the tests this spring/early summer. @tiborsimko

Sinclert commented 3 years ago

Hey there 👋🏻

With the aim of speeding things a bit (and given that I got not response from Lukas), I created a minimum example workflow to debug the described problem. Check it out at Scailfin/reana-slurm-test.

Within the repo, you can find instructions on how to run the workflow on Kubernetes and Slurm. Once you do, you will discover that Slurm runs always crash with Bravado HTTP errors, which are misleading, as they hide the real problem (which got described above).

irinaespejo commented 3 years ago

Hi !

I am commenting in this issue because of two things

@tiborsimko could you confirm/ comment if the REANA Developer Team would be able to solve the issue? The issue is time-sensitive for us and high priority. Thank you. Also, Sinclert mentioned that this commit is important fot the issue, so maybe @roksys can shed some light. Thanks!
Second, based on what @Sinclert said in the opening message of this issue

The workflow specification is written in Yadage, and it is totally functional on REANA 0.7.1, when using Kubernetes as the computational backend.

I have tried to submit the madminer-worrfklow with Kubernetes as backend and REANA version 0.7.1 and the workflow fails at the multistep-stage step. Does version 0.7.1 makes reference to this line? The same madminer-workflow has been succesfully deployed at BNL and NYU with Kubernetes as backend.

Here I post screenshots of the failing

REANA

roksys commented 3 years ago

Hey @irinaespejo I no longer work for CERN/REANA, so I won't be able to provide much help, but I think that using rsync instead of sftp within _download_dir method would solve the issue.

Sinclert commented 3 years ago

Hi @roksys ,

I am unsure whether that alone would solve the problem. Replacing sftp by rsync without reducing the scope of the command (from the workflow level folder to the step level folder), could still run into race condition issues.

If I am not mistaken, it would be as running rsync -r <src_dir> <dst_dir> at the same time (at the start of every parallel job), with exactly the same arguments... I think this StackExchange question highlights the problem with the approach you are proposing.

reanahub / reana-job-controller