Open dannymandel opened 9 months ago
For the moment, have added
Restart=on-failure
to the service definition on hyde.
That seemed to cause it to get killed immediately. Had to remove it.
Actually, it seems to still happen regardless of whether the Restart
statement is there or not.
dannymandel@hyde:~$ sudo service isamples_modelserver status
○ isamples_modelserver.service - Docker Compose iSamples Central Application Service
Loaded: loaded (/etc/systemd/system/isamples_modelserver.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Nov 28 13:11:26 hyde.cyverse.org docker[970989]: Container isamples_modelserver-isamples_modelserver-1 Started
Nov 28 13:11:26 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1 Stopping
Nov 28 13:11:26 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1 Stopping
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1 Stopped
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1 Removing
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1 Removed
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Network isamples_modelserver_default Removing
Nov 28 13:11:39 hyde.cyverse.org docker[971280]: Network isamples_modelserver_default Removed
Nov 28 13:11:39 hyde.cyverse.org systemd[1]: isamples_modelserver.service: Deactivated successfully.
Nov 28 13:11:39 hyde.cyverse.org systemd[1]: Finished Docker Compose iSamples Central Application Service.
So it is a problem with running isamples_modelserver
as a linux service. Works fine when I run it manually. When I run it as a service it spontaneously stops itself.
Going to table the investigation for now, there is an easy workaround.
Note that the crash happened again over the weekend, I still don't know what is up with it.
I've been experimenting with some signal handlers and shutdown events, and can see this locally with the additional debug logging in place:
/Users/mandeld/.virtualenvs/isamples_modelserver/bin/python /Users/mandeld/iSamples/isamples_modelserver/main.py
ERROR:root:Unable to locate fasttext model at path UNSET. All predictions will return NOT_PROVIDED.
INFO: Will watch for changes in these directories: ['/Users/mandeld/iSamples/isamples_modelserver']
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO: Started reloader process [15382] using StatReload
ERROR:root:Unable to locate fasttext model at path UNSET. All predictions will return NOT_PROVIDED.
Loading OpenContext sample model…
INFO: Started server process [15386]
INFO: Waiting for application startup.
Loading OpenContext material model…
Loading SESAR material model…
INFO: Application startup complete.
Loading done.
INFO: Shutting down
INFO: Waiting for application shutdown.
File "<string>", line 1, in <module>
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/spawn.py", line 135, in _main
return self._bootstrap(parent_sentinel)
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
target(sockets=sockets)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/server.py", line 61, in run
return asyncio.run(self.serve(sockets=sockets))
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
self.run_forever()
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
self._run_once()
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
handle._run()
File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/lifespan/on.py", line 86, in main
await app(scope, self.receive, self.send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/fastapi/applications.py", line 289, in __call__
await super().__call__(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/middleware/errors.py", line 149, in __call__
await self.app(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 55, in __call__
await self.app(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 707, in __call__
await self.lifespan(scope, receive, send)
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 569, in __aexit__
await self._router.shutdown()
File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 666, in shutdown
handler()
File "/Users/mandeld/iSamples/isamples_modelserver/main.py", line 79, in on_shutdown
traceback.print_stack()
INFO: Application shutdown complete.
INFO: Finished server process [15386]
Dumping stack trace because the python interpreter was killed.
INFO: Stopping reloader process [15382]
File "/Users/mandeld/iSamples/isamples_modelserver/main.py", line 56, in dump_stack_trace
traceback.print_stack()
Process finished with exit code 0
I'm tempted to see if we can get any more information using something like this deployed out to hyde.
Unfortunately it is only the last step in the shutdown process, but it at least registers that something is happening.
On thinking about this a bit more (and remembering that the service wasn't working to begin with), I decided to correlate the outage times with hyde restart times and see if they are happening at the same time.
Last outage reported time (check happens once an hour): Sun, Dec 10, 2:47 AM (3 days ago)
Last hyde reboot time: system boot 2023-12-10 02:01
So I think rather than a crash this just might be a symptom of the linux service not coming up after reboot.
I manually killed the container then restarted and it came up. This was the service file that seemed to work:
dannymandel@hyde:~$ cat /etc/systemd/system/isamples_modelserver.service
[Unit]
Description=Docker Compose iSamples Central Modelserver Application Service
Requires=docker.service
After=isamples_central.service
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/home/isamples/isamples_modelserver
ExecStart=/usr/bin/docker compose up -d --build
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=90
[Install]
WantedBy=multi-user.target
The two changes I made recently were:
(1) Updated description to include modelserver
to differentiate from the core container
(2) Changed the after=isamples_central.service
(used to be docker
)
Will keep an eye on it, but it is working for now.
So I think the reason there were no logs when inspecting the crashed Docker container was that it just never started.
There have been a couple times where the Docker container has mysteriously crashed on hyde.cyverse.org. It is unclear what is causing the issue. On log inspection, there isn't really any smoking gun.