riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity
http://law.readthedocs.io
BSD 3-Clause "New" or "Revised" License
96 stars 39 forks source link

`master` HTCondor workflow breaks submission of seemingly random jobs #183

Open HerrHorizontal opened 1 month ago

HerrHorizontal commented 1 month ago

Bug description

With the latest commit on the master branch the HTCondorWorkflow execution breaks for random jobs with an output like:

running htcondor_wrapper_2701241706.sh for job number 211
empty htcondor job arguments for LAW_HTCONDOR_JOB_NUMBER 211

I noticed that the reported LAW_HTCONDOR_JOB_NUMBER 211 does not correspond to the job number I would expect from the Error, Log, Output, and stdall files, that share for this particular job above the suffix _863To864.txt. This might be related.

HerrHorizontal commented 1 month ago

Checking out an earlier commit that doesn't include the HTCondor group submission introduced in PR #176, e.g. commit 1848c573f7b05e45299f749cf5f8da175026d416 seems to run fine. I expect the bug has been introduced in PR #176 .

riga commented 1 month ago

Hi @HerrHorizontal ,

odd indeed. Are you sure your workflow didn't pick up a submission json file that was generated with the previous submission mode?

Btw, for the time being, you can also set job::htcondor_job_grouping_submit to False without the need to switch to an older version.

HerrHorizontal commented 1 month ago

I have removed the submission json before I ran the test. So I am pretty sure that it didn't.

I will try this out. Where do I set the job::htcondor_job_grouping_submit configuration for a certain workflow?

riga commented 1 month ago

You can set this value globally in the config, or you put this into your htcondor workflow

def htcondor_create_job_manager(self, **kwargs):
    job_manager = super().htcondor_create_job_manager(**kwargs)
    job_manager.job_grouping_submit = True
    job_manager.chunk_size_submit = 0  # all in one
    return job_manager

Regarding the issue you're seeing, I could not spot anything obviously wrong. Could you sent me the content of the submission directory, including the main job files? This would help. Thank you!

harrypuuter commented 2 weeks ago

Hi all,

i am currently observing the same issue as reported by @HerrHorizontal - when I look at the submission jdl and the arguments in the htconodor_wrapper_xxxx.sh file, the first submission looks fine. Problems arise, as soon as a single job fails, and jobs have to be resubmitted.

I have not looked into the implementation in more detail, but I would suspect something like

Could this be the reason for the errors ?

riga commented 1 week ago

@harrypuuter

Thanks for confirming and the suggestion! I think you're onto something. I'm going to create a reproducer this week to debug this further.