nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

Remember already submitted htcondor jobs to avoid re-submitting #167

Open meliache opened 2 years ago

meliache commented 2 years ago

TODO's

codecov-commenter commented 2 years ago

Codecov Report

Merging #167 (caba809) into main (bd14265) will decrease coverage by 0.52%. The diff coverage is 11.76%.

@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
- Coverage   59.73%   59.21%   -0.53%     
==========================================
  Files          23       23              
  Lines        1530     1547      +17     
==========================================
+ Hits          914      916       +2     
- Misses        616      631      +15     
Impacted Files Coverage Δ
b2luigi/batch/processes/htcondor.py 55.88% <11.76%> (-6.31%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bd14265...caba809. Read the comment docs.

meliache commented 2 years ago

While testing I got the following error after a while and I'm trying to find out how it's related:

INFO: Worker Worker(salt=572125441, workers=800, host=naf-belle11.desy.de, username=meliache, pid=28250) was stopped. Shutting down Keep-Alive thread
Traceback (most recent call last):
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/worker.py", line 1208, in run
    self._run_task(get_work_response.task_id)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/worker.py", line 1012, in _run_task
    task_process.run()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 126, in run
    self.start_job()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 201, in start_job
    output = subprocess.check_output(["condor_submit", submit_file], cwd=submit_file_dir)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 808, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 1484, in _get_handles
    c2pread, c2pwrite = os.pipe()
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_naf_reconstruction.py", line 81, in <module>
    b2luigi.process(
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/cli/process.py", line 113, in process
    runner.run_local(task_list, cli_args, kwargs)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 46, in run_local
    run_luigi(task_list, cli_args, kwargs)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 62, in run_luigi
    luigi.build(task_list, **kwargs)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/interface.py", line 237, in build
    luigi_run_result = _schedule_and_run(tasks, worker_scheduler_factory, override_defaults=env_params)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/worker.py", line 607, in __exit__
    if task.is_alive():
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 135, in is_alive
    job_status = self.get_job_status()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 166, in get_job_status
    job_status = _batch_job_status_cache[self._batch_job_id]
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/site-packages/cachetools/__init__.py", line 371, in __getitem__
    return self.__missing__(key)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/cache.py", line 27, in __missing__
    self._ask_for_job_status(job_id=None)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/retry/api.py", line 80, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/retry/api.py", line 32, in __retry_internal
    return f()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 51, in _ask_for_job_status
    output = subprocess.check_output(q_cmd)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 808, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 1484, in _get_handles
    c2pread, c2pwrite = os.pipe()
OSError: [Errno 24] Too many open files