replicate / cog-triton

A cog implementation of Nvidia's Triton server
Apache License 2.0
11 stars 0 forks source link

restart triton during setup if it crashes or doesn't start within 3 minutes #22

Closed technillogue closed 5 months ago

technillogue commented 5 months ago

sporadically, triton will segfault during startup. because launch_triton_server.py uses Popen and exits, leaving mpirun as an orphan, our setup will just keep waiting for triton to come up forever and not even restart. if just running triton again can fix it and we can avoid re-downloading the weights, that's a win, but if not, we should still exit promptly instead of hitting the (very long) setup timeout for replicate-internal/replicate. however, if we figure out what's causing this actually fixing that would obviously be better than patching over it.

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0xb0

https://replicatehq.slack.com/archives/C0617RF4HHP/p1710458336169009?thread_ts=1710458110.755169&cid=C0617RF4HHP