Closed aquan9 closed 8 months ago
@aquan9 As I was reviewing I found some minor changes where .beeflow was still used and will commit them. However, I'm still testing. I believe I found an error if someone has a workflow running. I'll post soon.
This is an error that occured if a reset was done while workflows were still running. I'm thinking we should check for running workflows using beeflow list
and advise the user to either let them finish or cancel them via beeflow cancel <wf_id>
oops forgot to post the error:
Waiting for components to cleanly stop.
Traceback (most recent call last):
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/bin/beeflow", line 6, in <module>
sys.exit(main())
File "/vast/home/pagrubel/BEE/BEE/beeflow/client/bee_client.py", line 554, in main
app()
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 289, in __call__
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 280, in __call__
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1078, in main
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 783, in invoke
File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 607, in wrapper
File "/vast/home/pagrubel/BEE/BEE/beeflow/client/core.py", line 428, in reset
shutil.rmtree(directory_to_delete)
File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 732, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 671, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 669, in _rmtree_safe_fd
os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'x86_64-linux-gnu'
So if I had a workflow running when I did the beeflow core reset
it left a neo4j process running:
ps aux |grep pagrubel |grep -v grep| grep -E 'bee|slurmrest|neo4j' pagrubel 3228289 6.9 1.0 46490656 2892124 ? Sl 13:41 0:29 /usr/local/openjdk-8/bin/java -cp /var/lib/neo4j/plugins:/var/lib/neo4j/conf:/var/lib/neo4j/lib/*:/var/lib/neo4j/plugins/* -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+TrustFinalNonStaticFields -XX:+DisableExplicitGC -Djdk.tls.ephemeralDHKeySize=2048 -Djdk.tls.rejectClientInitiatedRenegotiation=true -Dunsupported.dbms.udc.source=tarball -Dfile.encoding=UTF-8 org.neo4j.server.CommunityEntryPoint --home-dir=/var/lib/neo4j --config-dir=/var/lib/neo4j/conf
And more if there was more than one workflow running. I think we should check for running workflows and inform the user that they will be cancelled if they continue with the reset, then we will need to kill the GDB instances for that user.
I'm wondering if the changes to fix this need to happen at the level of "quit" call. Because as it stands, the "beeflow stop" command should also have the same problem.
Both beeflow stop, and beeflow reset are calling:
resp = cli_connection.send(paths.beeflow_socket(), {'type': 'quit'})
Discussion during Oct 10 meeting:
neo4j orphaned processes have a file on ~/.beeflow so beeflow core stop
works but beeflow core reset
fails since it deletes the ~/.beeflow
~/.beeflow/worflows/
The pid for each neo4j instance is in the wf_manager database so we could kill those.
We also need to evaluate beeflow cancel
We still need to look at using a different database system, but fix this now.
For now should we search for any running workflows and if there are print a message telling the user they need to either wait or cancel the workflows.
1.) I get this error if -a is used and
2.) Maybe we should only be archiving the archives directory and the logs. I get this error when I try to archive (when the above doesn't apply). I think it has to do with some of the active sockets and processes. I'm thinking we should only copy
If I don't care to keep anything everything works fine.
@aquan9 I think if you will just copy the logs and archives the -a option will work. You may want to query if they want to copy the container_archive directory if it exists, since the user can change that to another location in the configuration file and the files can be quite large.
@jtronge Since I made the last changes would you please review them
This seems to work for me. If I tried to submit a workflow with the --no-start
option, then I ended up with the OSError: [Errno 39] Directory not empty: 'x86_64-linux-gnu'
error on calling reset, but maybe this is expected for that case.
Make a beeflow reset command with warning message. The command just finds and removes the .beeflow directory.
This should hopefully resolve https://github.com/lanl/BEE/issues/708
This is a continuation of PR #712