riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity
http://law.readthedocs.io
BSD 3-Clause "New" or "Revised" License
96 stars 39 forks source link

Simple slurm workflow leads to `KeyError: 'output_files'` #171

Closed joschkabirk closed 8 months ago

joschkabirk commented 8 months ago

Bug description

Hi!

I was using law 0.1.12 in a previous project and am confused why my setup for submitting Slurm jobs on Maxwell with law doesn't work anymore in law 0.1.16. (I didn't test versions between those two tbh).

I tried to reduce my example quite a bit. The example below should just submit one Slurm job that creates a dummy file.

When running this example, I obtain a KeyError: 'output_files' when the Slurm job file is created in law.

Click for full output/error log ```shell (law-0.1.16) [birkjosc@max-wgse002:~/testing/law_slurm] % ls law.cfg setup.sh tasks.py (law-0.1.16) [birkjosc@max-wgse002:~/testing/law_slurm] % source setup.sh indexing tasks in 1 module(s) loading module 'tasks', done module 'tasks', 1 task(s): - SlurmDummyWorkflow written 1 task(s) to index file '/home/birkjosc/testing/law_slurm/.law/index' (law-0.1.16) [birkjosc@max-wgse002:~/testing/law_slurm] % law run SlurmDummyWorkflow --print-status -1 130 ↵ print task status with max_depth -1 and target_depth 0 0 > SlurmDummyWorkflow(effective_workflow=slurm, branch=-1, slurm_partition=allcpu, max_runtime=01:00:00, workflow=slurm) jobs: LocalFileTarget(fs=local_fs, path=/home/birkjosc/testing/law_slurm/slurm_output/slurm_jobs_0To1.json, optional) absent collection: TargetCollection(len=1, threshold=1.0) absent (0/1) (law-0.1.16) [birkjosc@max-wgse002:~/testing/law_slurm] % law run SlurmDummyWorkflow 130 ↵ INFO: luigi-interface - Informed scheduler that task SlurmDummyWorkflow__1__False_5654b4651b has status PENDING INFO: luigi-interface - Done scheduling tasks INFO: luigi-interface - Running Worker with 1 processes INFO: luigi-interface - [pid 65917] Worker Worker(salt=4707907419, workers=1, host=max-wgse002.desy.de, username=birkjosc, pid=65917) running SlurmDummyWorkflow(effective_workflow=slurm, branch=-1, slurm_partition=allcpu, max_runtime=01:00:00, workflow=slurm) going to submit 1 slurm job(s) ERROR: luigi-interface - [pid 65917] Worker Worker(salt=4707907419, workers=1, host=max-wgse002.desy.de, username=birkjosc, pid=65917) failed SlurmDummyWorkflow(effective_workflow=slurm, branch=-1, slurm_partition=allcpu, max_runtime=01:00:00, workflow=slurm) Traceback (most recent call last): File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/luigi/worker.py", line 203, in run new_deps = self._run_get_new_deps() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/luigi/worker.py", line 138, in _run_get_new_deps task_gen = self.task.run() ^^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/workflow/remote.py", line 628, in run self.submit() File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/workflow/remote.py", line 812, in submit job_ids, submission_data = self._submit_batch(submit_jobs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/workflow/remote.py", line 860, in _submit_batch all_job_files[job_num] = self.create_job_file(job_num, branches) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/contrib/slurm/workflow.py", line 139, in create_job_file job_file, c = self.job_file_factory(postfix=postfix, **c.__dict__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/job/base.py", line 777, in __call__ return self.create(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/contrib/slurm/job.py", line 340, in create c.output_files = list(map(str, c.output_files)) ^^^^^^^^^^^^^^ File "/beegfs/desy/user/birkjosc/conda/envs/law-0.1.16/lib/python3.11/site-packages/law/job/base.py", line 722, in __getattr__ return self.__dict__[attr] ~~~~~~~~~~~~~^^^^^^ KeyError: 'output_files' INFO: luigi-interface - Informed scheduler that task SlurmDummyWorkflow__1__False_5654b4651b has status FAILED INFO: luigi-interface - Worker Worker(salt=4707907419, workers=1, host=max-wgse002.desy.de, username=birkjosc, pid=65917) was stopped. Shutting down Keep-Alive thread INFO: luigi-interface - ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 failed: - 1 SlurmDummyWorkflow(...) This progress looks :( because there were failed tasks ===== Luigi Execution Summary ===== ```

When commenting out this line in law/contrib/slurm/job.py, the job runs without problems.

I assume something I am doing is wrong? What confused me is that this works with an older version of law and I didn't see changes in the example for the Slurm workflow on Maxwell.

Click for full output after commenting the mentioned line out ```shell ... ... (law-0.1.16) [birkjosc@max-wgse002:~/testing/law_slurm] % law run SlurmDummyWorkflow 40 ↵ INFO: luigi-interface - Informed scheduler that task SlurmDummyWorkflow__1__False_5654b4651b has status PENDING INFO: luigi-interface - Done scheduling tasks INFO: luigi-interface - Running Worker with 1 processes INFO: luigi-interface - [pid 80038] Worker Worker(salt=896605782, workers=1, host=max-wgse002.desy.de, username=birkjosc, pid=80038) running SlurmDummyWorkflow(effective_workflow=slurm, branch=-1, slurm_partition=allcpu, max_runtime=01:00:00, workflow=slurm) going to submit 1 slurm job(s) submitted 1/1 job(s) submitted 1 slurm job(s) 17:36:50: all: 1, pending: 0 (+0), running: 0 (+0), finished: 1 (+1), retry: 0 (+0), failed: 0 (+0) polling took 1 second INFO: luigi-interface - [pid 80038] Worker Worker(salt=896605782, workers=1, host=max-wgse002.desy.de, username=birkjosc, pid=80038) done SlurmDummyWorkflow(effective_workflow=slurm, branch=-1, slurm_partition=allcpu, max_runtime=01:00:00, workflow=slurm) INFO: luigi-interface - Informed scheduler that task SlurmDummyWorkflow__1__False_5654b4651b has status DONE INFO: luigi-interface - Worker Worker(salt=896605782, workers=1, host=max-wgse002.desy.de, username=birkjosc, pid=80038) was stopped. Shutting down Keep-Alive thread INFO: luigi-interface - ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 SlurmDummyWorkflow(...) This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary ===== ```
Click to see used files ## Used files to reproduce I use a freshly created conda environment with subsequent `pip install law`. ```zsh (law-0.1.16) [birkjosc@max-wgse002:~/testing/law_slurm] % ls law.cfg setup.sh tasks.py ``` `tasks.py`: ```python import law import luigi law.contrib.load("slurm") class SlurmDummyWorkflow(law.slurm.SlurmWorkflow): slurm_partition = luigi.Parameter(default="allcpu") max_runtime = law.DurationParameter(default=1.0, unit="h") def slurm_output_directory(self): # the directory where submission meta data should be stored return law.LocalDirectoryTarget( law.util.rel_path(__file__, "slurm_output") ) def create_branch_map(self): return {0: "dummy-branch"} def output(self): dummy_output = law.util.rel_path(__file__, "dummy_output.txt") return law.LocalFileTarget(dummy_output) def run(self): with self.output().open("w") as f: f.write("dummy output\n") ``` `law.cfg`: ``` [modules] tasks [job] job_file_dir: /home/birkjosc/testing/law_slurm/slurm_logs job_file_dir_cleanup: False [logging] luigi-interface: INFO [luigi_core] local_scheduler: True no_lock: True [luigi_worker] keep_alive: True ping_interval: 20 wait_interval: 20 max_reschedules: 0 [luigi_scheduler] retry_count: 0 ``` `setup.sh`: ```shell #!/usr/bin/env bash action() { local this_dir="$( cd "$( dirname "${this_file}" )" && pwd )" export PYTHONPATH="${PWD}:${PYTHONPATH}" export LAW_HOME="${this_dir}/.law" export LAW_CONFIG_FILE="${this_dir}/law.cfg" source "$( law completion )" "" law index --verbose } action ```
riga commented 8 months ago

Thanks for reporting @joschkabirk!

Sounds like a typo on my end in the slurm workflow implementation. I will look into this and get back to you.

joschkabirk commented 8 months ago

Thanks @riga !

riga commented 8 months ago

Ok, I think this was just a stray line that sneaked in during the last refactoring.

Could you give it another try with the latest master? Thanks!

(pip install git+ssh://git@github.com/riga/law.git@master)

joschkabirk commented 8 months ago

Yep, with the latest master everything is running as expected.

Thanks for looking into this! Solved from my side then 👍