lanl / BEE

Other
13 stars 3 forks source link

Container_archive in a different location than bee_workdir #726

Closed pagrubel closed 7 months ago

pagrubel commented 9 months ago

If the user specifies container_archive in the config file different from the bee_workdir, and if that doesn't directory doesn't exist yet, the task manager fails to start. Here are the errors from the log:

    mod = importlib.import_module(module)
  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/vast/home/pagrubel/BEE/BEE/beeflow/task_manager.py", line 318, in <module>
    worker = WorkerInterface(worker_class, **worker_kwargs)
  File "/vast/home/pagrubel/BEE/BEE/beeflow/common/worker_interface.py", line 23, in __init__
    self._worker = worker(**kwargs)
  File "/vast/home/pagrubel/BEE/BEE/beeflow/common/worker/slurm_worker.py", line 224, in __init__
    super().__init__(**kwargs)
  File "/vast/home/pagrubel/BEE/BEE/beeflow/common/worker/worker.py", line 44, in __init__
    self.crt = ContainerRuntimeInterface(crt_driver)
  File "/vast/home/pagrubel/BEE/BEE/beeflow/common/crt_interface.py", line 25, in __init__
    self._crt_driver = crt_driver()
  File "/vast/home/pagrubel/BEE/BEE/beeflow/common/crt/charliecloud_driver.py", line 32, in __init__
    self.container_archive = bc.resolve_path(container_archive)
  File "/vast/home/pagrubel/BEE/BEE/beeflow/common/config_driver.py", line 153, in resolve_path
    os.chdir(os.path.dirname(relative_path))
FileNotFoundError: [Errno 2] No such file or directory: '/vast/home/pagrubel/.beeflow'
[2023-09-27 12:07:18 -0600] [2946053] [INFO] Worker exiting (pid: 2946053)
[2023-09-27 12:07:18 -0600] [2946025] [INFO] Shutting down: Master
[2023-09-27 12:07:18 -0600] [2946025] [INFO] Reason: Worker failed to boot.

Also, when this happened there was not notification that the task manager wasn't working, I'm thinking we should display the states of the components when starting beeflow (I do that manually most times but it should be automatic):

 beeflow core start
Checking dependencies...
Found Charliecloud 0.34
Starting beeflow...
Check "/vast/home/pagrubel/workingdir/logs/beeflow.log" or run `beeflow core status` for more information.
(hpc-beeflow-py3.9) (base) pagrubel@darwin-fe1 beeworkdir$ beeflow core status
beeflow components:
scheduler ... RUNNING
slurmrestd ... RUNNING
wf_manager ... RUNNING
task_manager ... FAILED
pagrubel commented 9 months ago

In addition I tried submitting the workflow and the tasks just wait with no message to the user why, but I suppose this is another issue. I thought I'd just record it here.

beeflow submit pennant-copy pennant-copy/ pennant-copy/pennant_wf.cwl pennant-copy/pennant.yml ~/beeworkdir
Detected directory instead of packaged workflow. Packaging Directory...
Package pennant-copy.tgz created successfully
Workflow submitted! Your workflow id is c8d2dc.
Started workflow!
(hpc-beeflow-py3.9) (base) pagrubel@darwin-fe1 beeworkdir$ beeflow query c8d2dc
Running
pen_1_node--WAITING
pen_2_node--WAITING
pen_4_node--WAITING
pen_8_node--WAITING
graph--WAITING
pagrubel commented 7 months ago

Closing this as it is fixed