lanl / BEE

Other
13 stars 3 forks source link

Fix reset error: addresses issue #757 #793

Closed pagrubel closed 3 months ago

pagrubel commented 3 months ago
- check workflow state
- do not allow reset when there are Running or Intializing workflows
- add get_workflow_list method to eliminate duplicate code
- make response text message clearer to read in code

Resolves: issue #757

rstyd commented 3 months ago

Instead of stopping a reset when there are initializing workflows, I think it'd be better to just kill them outright, but ask the user first.

pagrubel commented 3 months ago

Instead of stopping a reset when there are initializing workflows, I think it'd be better to just kill them outright, but ask the user first.

How should I do that? What needs to be killed?

pagrubel commented 3 months ago

I will try to modify this to search for any Running or Initializing workflows, give the user a chance to stop the reset process. If they want to continue I will attempt to cancel the workflows, then do the stop and delete dir. I may put a longer wait in too, just to get around the Initializing problem.

rstyd commented 3 months ago

Oh sorry missed this. We want to just kill all the currently initializing or running workflows exactly as you described.

kchilleri commented 3 months ago

This is what I get when I have an initializing workflow and try to beeflow core reset:

(base) [kchilleri@darwin-fe3 BEE]$ git checkout issue757/fix_reset_error
branch 'issue757/fix_reset_error' set up to track 'origin/issue757/fix_reset_error'.
Switched to a new branch 'issue757/fix_reset_error'
(base) [kchilleri@darwin-fe3 BEE]$ git status
On branch issue757/fix_reset_error
Your branch is up to date with 'origin/issue757/fix_reset_error'.
(base) [kchilleri@darwin-fe3 BEE]$ cd workdir
(base) [kchilleri@darwin-fe3 workdir]$ cp /vast/home/kchilleri/BEE/examples/cat-grep-tar/lorem.txt .
(base) [kchilleri@darwin-fe3 workdir]$ poetry shell
Spawning shell within /vast/home/kchilleri/.cache/pypoetry/virtualenvs/hpc-beeflow-PIafEbRq-py3.9
. /vast/home/kchilleri/.cache/pypoetry/virtualenvs/hpc-beeflow-PIafEbRq-py3.9/bin/activate
(base) [kchilleri@darwin-fe3 workdir]$ . /vast/home/kchilleri/.cache/pypoetry/virtualenvs/hpc-beeflow-PIafEbRq-py3.9/bin/activate
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow core start
Checking dependencies...

Found Charliecloud 0.37
Starting beeflow...
Run `beeflow core status` for more information.
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow core status
beeflow components:
redis ... RUNNING
scheduler ... RUNNING
celery ... RUNNING
slurmrestd ... RUNNING
wf_manager ... RUNNING
task_manager ... RUNNING
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow list
There are currently no workflows.
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow package /vast/home/kchilleri/BEE/examples/cat-grep-tar .
Package cat-grep-tar.tgz created successfully
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow submit wf1 ./cat-grep-tar.tgz  workflow.cwl input.yml /vast/home/kchilleri/BEE/workdir
Package cat-grep-tar.tgz unpackaged successfully
Workflow submitted! Your workflow id is 67122d.
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow list
Name    ID  Status
wf1 67122d  Initializing
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow query 67122d
Initializing
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow list
Name    ID  Status
wf1 67122d  Initializing
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow core reset
There are 'Initializing' workflows. Reset may fail. Check 'beeflow list'i
A reset will remove this directory: /vast/home/kchilleri/.beeflow

Are you sure you want to reset?

Please ensure all workflows are complete before running a reset
Check the status of workflows by running 'beeflow list'

A reset will shutdown beeflow and its components.

A reset will delete the bee_workdir directory which results in:
Removing the archive of workflows executed.
Removing the archive of workflow containers.
Reset all databases associated with the beeflow app.
Removing all beeflow logs.

Beeflow configuration files from bee_cfg will remain.

Respond with yes(y)/no(n):  y
Beeflow has been shutdown.
Waiting for components to cleanly stop.
Unable to remove /vast/home/kchilleri/.beeflow.
 [Errno 39] Directory not empty: 'x86_64-linux-gnu'
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow core status
Cannot connect to the beeflow daemon, is it running? Check the log at "/vast/home/kchilleri/.beeflow/logs/beeflow.log".
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow core start
Checking dependencies...

Found Charliecloud 0.37
Starting beeflow...
Run `beeflow core status` for more information.
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow core status
beeflow components:
redis ... RUNNING
scheduler ... RUNNING
celery ... RUNNING
slurmrestd ... RUNNING
wf_manager ... RUNNING
task_manager ... RUNNING
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow list
Name    ID  Status
wf1 67122d  Initializing
(hpc-beeflow-py3.9) (base) [kchilleri@darwin-fe3 workdir]$ beeflow cancel 67122d
Workflow is Initializing cannot cancel.
pagrubel commented 3 months ago

I am going to close this pull request and open another, apparently I still have some rebasing problems with it. I will add some information about how I handle Running and Initializing workflows as well as other active workflows.