lanl / BEE

Other
13 stars 3 forks source link

Add reset command from previous branch #724

Closed aquan9 closed 8 months ago

aquan9 commented 9 months ago

Make a beeflow reset command with warning message. The command just finds and removes the .beeflow directory.

This should hopefully resolve https://github.com/lanl/BEE/issues/708

This is a continuation of PR #712

pagrubel commented 9 months ago

@aquan9 As I was reviewing I found some minor changes where .beeflow was still used and will commit them. However, I'm still testing. I believe I found an error if someone has a workflow running. I'll post soon.

pagrubel commented 9 months ago

This is an error that occured if a reset was done while workflows were still running. I'm thinking we should check for running workflows using beeflow list and advise the user to either let them finish or cancel them via beeflow cancel <wf_id>

pagrubel commented 9 months ago

oops forgot to post the error:

Waiting for components to cleanly stop.
Traceback (most recent call last):

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/bin/beeflow", line 6, in <module>
    sys.exit(main())

  File "/vast/home/pagrubel/BEE/BEE/beeflow/client/bee_client.py", line 554, in main
    app()

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 289, in __call__

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 280, in __call__

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1157, in __call__

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1078, in main

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1434, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 783, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 607, in wrapper

  File "/vast/home/pagrubel/BEE/BEE/beeflow/client/core.py", line 428, in reset
    shutil.rmtree(directory_to_delete)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 732, in rmtree
    _rmtree_safe_fd(fd, path, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 671, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 669, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)

OSError: [Errno 39] Directory not empty: 'x86_64-linux-gnu'
pagrubel commented 9 months ago

So if I had a workflow running when I did the beeflow core reset it left a neo4j process running: ps aux |grep pagrubel |grep -v grep| grep -E 'bee|slurmrest|neo4j' pagrubel 3228289 6.9 1.0 46490656 2892124 ? Sl 13:41 0:29 /usr/local/openjdk-8/bin/java -cp /var/lib/neo4j/plugins:/var/lib/neo4j/conf:/var/lib/neo4j/lib/*:/var/lib/neo4j/plugins/* -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+TrustFinalNonStaticFields -XX:+DisableExplicitGC -Djdk.tls.ephemeralDHKeySize=2048 -Djdk.tls.rejectClientInitiatedRenegotiation=true -Dunsupported.dbms.udc.source=tarball -Dfile.encoding=UTF-8 org.neo4j.server.CommunityEntryPoint --home-dir=/var/lib/neo4j --config-dir=/var/lib/neo4j/conf

And more if there was more than one workflow running. I think we should check for running workflows and inform the user that they will be cancelled if they continue with the reset, then we will need to kill the GDB instances for that user.

aquan9 commented 9 months ago

I'm wondering if the changes to fix this need to happen at the level of "quit" call. Because as it stands, the "beeflow stop" command should also have the same problem.

Both beeflow stop, and beeflow reset are calling:

resp = cli_connection.send(paths.beeflow_socket(), {'type': 'quit'})
pagrubel commented 9 months ago

Discussion during Oct 10 meeting:

neo4j orphaned processes have a file on ~/.beeflow so beeflow core stop works but beeflow core reset fails since it deletes the ~/.beeflow ~/.beeflow/worflows/ is bind mounted into neo4j in /tmp so as long as an instance is running ~/.beeflow can't be deleted

The pid for each neo4j instance is in the wf_manager database so we could kill those.

We also need to evaluate beeflow cancel which leaves orphaned neo4j instances around

We still need to look at using a different database system, but fix this now.

For now should we search for any running workflows and if there are print a message telling the user they need to either wait or cancel the workflows.

pagrubel commented 8 months ago

1.) I get this error if -a is used and .backup already exists: error.txt If the -a --archive flag is set, check for the file, before doing anything else and give a warning and exit.

2.) Maybe we should only be archiving the archives directory and the logs. I get this error when I try to archive (when the above doesn't apply). I think it has to do with some of the active sockets and processes. I'm thinking we should only copy /archives and logs, maybe the db files. Would that help? error-archive.txt

If I don't care to keep anything everything works fine.

pagrubel commented 8 months ago

@aquan9 I think if you will just copy the logs and archives the -a option will work. You may want to query if they want to copy the container_archive directory if it exists, since the user can change that to another location in the configuration file and the files can be quite large.

pagrubel commented 8 months ago

@jtronge Since I made the last changes would you please review them

jtronge commented 8 months ago

This seems to work for me. If I tried to submit a workflow with the --no-start option, then I ended up with the OSError: [Errno 39] Directory not empty: 'x86_64-linux-gnu' error on calling reset, but maybe this is expected for that case.