FST is going down abruptly.

k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.

https://k2-fsa.github.io/k2

Apache License 2.0

1.08k stars 211 forks source link

FST is going down abruptly. #1264

Closed kbramhendra closed 3 months ago

kbramhendra commented 7 months ago

Hi , I am using FST for production kind of setup. I have built fst using #1218 branch and torch 1.14. The fst is going down abruptly without any particular reason. its not because of OOM issue neither any utterance is triggering it. @pkufool can you please suggest any ways to mitigate this.

danpovey commented 7 months ago

Would need much more information. Presumably a process is dying: what code is it running? How is it terminating, e.g. what signal? If python, there should be ways to catch the signal with a try-catch at the outer level of the code and report it before dying.

kbramhendra commented 7 months ago

hi thanks for replying...Its running on triton inference server with python. overall setup is there with kubernetes. How to stop this process from dying. There aren't any signals per say. Memory is fine both GPU and CPU and its in idle state.

danpovey commented 7 months ago

Inference-server stuff would normally be a sherpa issue, did you build that with sherpa? If so you should probably open an issue on the associated repo. IDK why you think this is specifically about the FST. But when a process dies, either it exits or it dies by a signal. I'm not an expert on how to debug such things, and haven't used triton, but there should always be a way to track it down, e.g. get a stack trace. Perhaps some debug setting.

kbramhendra commented 7 months ago

Its not build with sherpa. I have 3 process running on the GPU. Encoder and CTC and FST modules. Encoder and CTC are onnx modules , these processes are still running only FST process is getting died down. All these are in docker setup, so its becoming difficult for track me to track it down. It exists suddenly. Can we prevent this from happening ?

danpovey commented 7 months ago

M that's tricky, but in principle it should be possible to reproduce it without docker for debugging purposes.

kbramhendra commented 7 months ago

yeah...I have been trying to reproduce it but couldn't succeed. I will try to share logs if i find any...if you find any such cases or solution in future please let me know. Thank you.

kbramhendra commented 3 months ago

Hi, The issue was found to be in the triton memory management. Thanks for helping.