ubccr / hpc-toolset-tutorial

Tutorial for installing Open XDMoD, OnDemand, & ColdFront
GNU General Public License v3.0
121 stars 72 forks source link

OnDemand offline after docker-compose stop/start #134

Closed aebruno closed 2 years ago

aebruno commented 2 years ago

It would be nice if users could stop the containers and restart them later without losing their state. For example, user completes the first half of the tutorial, stop containers go eat lunch etc. Then come back and start containers again should allow them to pick up where they left off. This flow currently works:

$ ./hpcts start
$ docker-compose down
$ docker-compose up

OnDemand restarts just fine, however a docker-compose down stops and removes the containers (and any networks).

This flow causes OnDemand to come backup in "offline mode":

$ ./hpcts start
$ ./docker-compose stop
$ ./docker-compose start

ood-offline

@johrstrom Any thoughts? Seems like we should be able to support the stop/start of the containers.

johrstrom commented 2 years ago

It's something similar to the slurmd issue. Something's being cached that apache doesn't like, like a stale PID file or something.

aebruno commented 2 years ago

This is still not working for me. Steps to re-produce:

  1. Start fresh then stop:

    $ ./hpcts start
    ...
    $ ./hpcts stop
  2. Start containers again

    $ ./hpcts start

ColdFront and XDMoD start fine. OnDemand and DEX fail to come back up, here;s the logs:

ondemand  | nc: connect to frontend (172.19.0.7) port 22 (tcp) failed: Connection refused
ondemand  | -- Waiting for frontend ssh to become active ...
ondemand  | Connection to frontend (172.19.0.7) 22 port [tcp/ssh] succeeded!
ondemand  | ---> Cleaning NGINX ...
ondemand  | can't find user for hpcadmin
ondemand  | Run 'nginx_stage --help' to see a full list of available command line options.

@johrstrom I know it's getting down to the wire here but would be great if we could sort this out. I'm happy to rebuild the OOD containers again.

aebruno commented 2 years ago

I tried this a few times and seems like a race condition. Unless we sort this out, we'll just have to let users know if they hit this to try again or run:

./hpcts destroy
./hpcts start

The above should always bring everything back up fresh without having to re-download the images.

johrstrom commented 2 years ago

:face_palm - I'm sorry that I thought this was settled. Yes we likely need to run nginx_stage after we start SSSD. My sincere apologies.

without having to re-download the images.

It's not about downloading the images - it's about starting a container that had previously been started. They don't need a new image, they need a new/fresh container. Not starting an older container.

I believe docker-compose down stops and removes containers whereas docker-compose stop just stops existing containers so that it can start them up again later.