isamplesorg / isamples_modelserver

REST API for invoking iSamples ML models
MIT License
0 stars 0 forks source link

Crashing on hyde.cyverse.org #2

Open dannymandel opened 9 months ago

dannymandel commented 9 months ago

There have been a couple times where the Docker container has mysteriously crashed on hyde.cyverse.org. It is unclear what is causing the issue. On log inspection, there isn't really any smoking gun.

dannymandel commented 9 months ago

For the moment, have added

Restart=on-failure

to the service definition on hyde.

dannymandel commented 9 months ago

That seemed to cause it to get killed immediately. Had to remove it.

dannymandel commented 9 months ago

Actually, it seems to still happen regardless of whether the Restart statement is there or not.

dannymandel commented 9 months ago
dannymandel@hyde:~$ sudo service isamples_modelserver status
○ isamples_modelserver.service - Docker Compose iSamples Central Application Service
     Loaded: loaded (/etc/systemd/system/isamples_modelserver.service; disabled; vendor preset: enabled)
     Active: inactive (dead)

Nov 28 13:11:26 hyde.cyverse.org docker[970989]: Container isamples_modelserver-isamples_modelserver-1  Started
Nov 28 13:11:26 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Stopping
Nov 28 13:11:26 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Stopping
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Stopped
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Removing
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Removed
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Network isamples_modelserver_default  Removing
Nov 28 13:11:39 hyde.cyverse.org docker[971280]: Network isamples_modelserver_default  Removed
Nov 28 13:11:39 hyde.cyverse.org systemd[1]: isamples_modelserver.service: Deactivated successfully.
Nov 28 13:11:39 hyde.cyverse.org systemd[1]: Finished Docker Compose iSamples Central Application Service.
dannymandel commented 9 months ago

So it is a problem with running isamples_modelserver as a linux service. Works fine when I run it manually. When I run it as a service it spontaneously stops itself.

dannymandel commented 9 months ago

Going to table the investigation for now, there is an easy workaround.

dannymandel commented 8 months ago

Note that the crash happened again over the weekend, I still don't know what is up with it.

dannymandel commented 8 months ago

I've been experimenting with some signal handlers and shutdown events, and can see this locally with the additional debug logging in place:

/Users/mandeld/.virtualenvs/isamples_modelserver/bin/python /Users/mandeld/iSamples/isamples_modelserver/main.py 
ERROR:root:Unable to locate fasttext model at path UNSET.  All predictions will return NOT_PROVIDED.
INFO:     Will watch for changes in these directories: ['/Users/mandeld/iSamples/isamples_modelserver']
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO:     Started reloader process [15382] using StatReload
ERROR:root:Unable to locate fasttext model at path UNSET.  All predictions will return NOT_PROVIDED.
Loading OpenContext sample model…
INFO:     Started server process [15386]
INFO:     Waiting for application startup.
Loading OpenContext material model…
Loading SESAR material model…
INFO:     Application startup complete.
Loading done.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
  File "<string>", line 1, in <module>
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
    target(sockets=sockets)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/server.py", line 61, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
    self.run_forever()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
    handle._run()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/lifespan/on.py", line 86, in main
    await app(scope, self.receive, self.send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/middleware/errors.py", line 149, in __call__
    await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 55, in __call__
    await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 707, in __call__
    await self.lifespan(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 569, in __aexit__
    await self._router.shutdown()
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 666, in shutdown
    handler()
  File "/Users/mandeld/iSamples/isamples_modelserver/main.py", line 79, in on_shutdown
    traceback.print_stack()
INFO:     Application shutdown complete.
INFO:     Finished server process [15386]
Dumping stack trace because the python interpreter was killed.
INFO:     Stopping reloader process [15382]
  File "/Users/mandeld/iSamples/isamples_modelserver/main.py", line 56, in dump_stack_trace
    traceback.print_stack()

Process finished with exit code 0

I'm tempted to see if we can get any more information using something like this deployed out to hyde.

dannymandel commented 8 months ago

Unfortunately it is only the last step in the shutdown process, but it at least registers that something is happening.

dannymandel commented 8 months ago

On thinking about this a bit more (and remembering that the service wasn't working to begin with), I decided to correlate the outage times with hyde restart times and see if they are happening at the same time.

Last outage reported time (check happens once an hour): Sun, Dec 10, 2:47 AM (3 days ago)

Last hyde reboot time: system boot 2023-12-10 02:01

So I think rather than a crash this just might be a symptom of the linux service not coming up after reboot.

dannymandel commented 8 months ago

I manually killed the container then restarted and it came up. This was the service file that seemed to work:

dannymandel@hyde:~$ cat /etc/systemd/system/isamples_modelserver.service 
[Unit]
Description=Docker Compose iSamples Central Modelserver Application Service
Requires=docker.service
After=isamples_central.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/home/isamples/isamples_modelserver
ExecStart=/usr/bin/docker compose up -d --build 
ExecStop=/usr/bin/docker compose down 
TimeoutStartSec=90

[Install]
WantedBy=multi-user.target
dannymandel commented 8 months ago

The two changes I made recently were:

(1) Updated description to include modelserver to differentiate from the core container (2) Changed the after=isamples_central.service (used to be docker)

Will keep an eye on it, but it is working for now.

dannymandel commented 8 months ago

So I think the reason there were no logs when inspecting the crashed Docker container was that it just never started.