Open Sinclert opened 3 years ago
Hello, if I may add that this is quite time sensitive for the SCAILFIN project as it is tied to the scalability tests that we promised as a deliverable for the NSF grant. That grant ends at the end of this summer, so we were hoping to do the tests this spring/early summer. @tiborsimko
Hey there šš»
With the aim of speeding things a bit (and given that I got not response from Lukas), I created a minimum example workflow to debug the described problem. Check it out at Scailfin/reana-slurm-test.
Within the repo, you can find instructions on how to run the workflow on Kubernetes and Slurm. Once you do, you will discover that Slurm runs always crash with Bravado HTTP
errors, which are misleading, as they hide the real problem (which got described above).
Hi !
I am commenting in this issue because of two things
@tiborsimko could you confirm/ comment if the REANA Developer Team would be able to solve the issue? The issue is time-sensitive for us and high priority. Thank you. Also, Sinclert mentioned that this commit is important fot the issue, so maybe @roksys can shed some light. Thanks!
Second, based on what @Sinclert said in the opening message of this issue
The workflow specification is written in Yadage, and it is totally functional on REANA 0.7.1, when using Kubernetes as the computational backend.
I have tried to submit the madminer-worrfklow with Kubernetes as backend and REANA
version 0.7.1
and the workflow fails at the multistep-stage
step. Does version 0.7.1
makes reference to this line?
The same madminer-workflow
has been succesfully deployed at BNL and NYU with Kubernetes as backend.
Here I post screenshots of the failing
Hey @irinaespejo I no longer work for CERN/REANA, so I won't be able to provide much help, but I think that using rsync
instead of sftp
within _download_dir
method would solve the issue.
Hi @roksys ,
I am unsure whether that alone would solve the problem. Replacing sftp
by rsync
without reducing the scope of the command (from the workflow level folder to the step level folder), could still run into race condition issues.
If I am not mistaken, it would be as running rsync -r <src_dir> <dst_dir>
at the same time (at the start of every parallel job), with exactly the same arguments... I think this StackExchange question highlights the problem with the approach you are proposing.
This issue describes an undesirable behaviour found within the
SlurmJobManagerCERN
class, discovered between Carl Evans (NYU HPC) and myself (NYU CDS).Context
We are currently trying to run a complex workflow (see madminer-workflow for reference) on REANA
0.7.3
, using SLURM as the computational backend. The workflow specification is written in Yadage, and it is totally functional on REANA0.7.1
, when using Kubernetes as the computational backend.Problem
The problem is found on any Yadage spec. using the multistep-stage
scheduler_type
value (where multiple "step-jobs" are run in parallel), when those "step-jobs" depend on scattered files to perform their computations.In those scenarios, the
SlurmJobManagerCERN._download_dir
function, in addition to be somehow inefficient (it crawls through every file and directory in the SLURM workdir, making each step to scan everything all previous steps created), overrides the whole workflow WORKDIR at the start of each "step-job".We have recently raised concerns about this behaviour on the REANA Mattermost channel (precisely here) where we thought the problem was due to the
publisher_type
within the Yadage specification. Turns out that was not the case, but instead it is due to thescheduler_type
multistep-stage value.Testing
We did some preliminary testing to properly identify the scope of the issue.
We are fairly sure the issue is located within the
SlurmJobManagerCERN._download_dir
function, as we have performed some testing on a customreana-job-controller
Docker image (where we have tuned this function and hardcoded some paths to our needs), and we were able to run the complete workflow successfully āPossible solution
We believe a good patch would involve reducing the scope of the
SlurmJobManagerCERN._download_dir
function WORKDIR copying procedure, from the "workflow" level to the "step-job" level. That way, there will not be any overriding problems among parallel "step-jobs" within the same workflow stage.Additional clarifications
This issue has not being detected in any of the workflows you guys use for testing because none of them use multistep-stage
scheduler_type
values, involving files. See:@lukasheinrich offered to create a dummy workflow to test this behaviour, but no progress has been done so far (message).