lanl / BEE

Other
14 stars 3 forks source link

Add Check for Different Front End #883

Open rstyd opened 2 months ago

rstyd commented 2 months ago

Currently, if a user starts BEE on one front end of an arbitrary HPC cluster, let's call it clusterfe1, then tries to use beeflow commands on another front end of the same cluster the commands will fail since processes on different front ends usually can't communicate with each other.

If the user tries to run a beeflow core command they'll get the error: Cannot connect to the beeflow daemon, is it running? Check the log at ".beeflow/logs/beeflow.log".

If the user tries to use any beeflow commands they'll get a message like Submit: Could not reach WF Manager.

We should add a check in core.py and client.py for to make sure the host that the user is running on is the same as the host beeflow is currently running on.

One big issue is there isn't currently a clean way to get this information.

We have several options:

  1. The beeflow log at .beeflow/logs/beeflow.log says the front end on which beeflow was last started in the format.

    Running on cluster-fe1
    Launching components in order: ['redis', 'scheduler', 'celery', 'slurmrestd', 'wf_manager', 'task_manager']

    We could grep the last Running message out of the log (and verify there wasn't a Kill operation afterwards) to get this info. This could break if we ever make changes to the beeflow log and is kind of brittle.

  2. Alternatively, we could add the hostname where beeflow is running to the workflow DB and get that information in the beeflow client. Currently, we're only using the workflow DB in the wf_manager so this would add another piece of code that depends on it which breaks our modularity somewhat. Another issue is that this won't work if in the future we enable a client to run on a separate system from the one where the workflow manager is running, but that situation wouldn't be impacted by this problem so we'd need to just not do this check if we're connecting to the workflow manager from another machine.

I think option 2 is the best solution at the moment.

pagrubel commented 3 weeks ago

@kchilleri When you check for the location that beeflow is starting from can you also check if the environment variable SLURMD_NODENAME exists and print it out with a warning that they are on a compute node and not allow beeflow to start.