nuclio / nuclio

High-Performance Serverless event and data processing platform
https://nuclio.io
Apache License 2.0
5.3k stars 533 forks source link

Long running tasks can't finish #2556

Open kierzniak opened 2 years ago

kierzniak commented 2 years ago

Hi, I'm using Nucltio wih CVAT and I have a problem with automatic annotation using nuclio + yolo-v3-tf. Nuclio is deployed to dedicated server using Docker.

The problem is that my 1.5h video can't finish annotating. That long video should be processed in about 9h but after 6h I get an error that taks can't be finished.

After investigating logs this message caught my attention: "Failed to read functions from a local store". Apparently "nuclio-local-storage-reader" container is restarting after 6h Nuclio is returning error to CVAT and CVAT is stopping process. How I can reconfigure "nuclio-local-storage-reader" to restart after e.g. 24h? Or maybe problem is with CVAT which should handle this error differently?

liranbg commented 2 years ago

Interesting edge case @kierzniak. I am not sure why CVAT returns an error once nuclio-local-storage-reader is down, because nuclio-local-storage-reader is used for function deployment and status tracking. The 6h is hard-coded, but again, not sure how it effects CVAT in term of cancelling the procedure entirely.

kierzniak commented 2 years ago

In CVAT logs I'm just getting "500 Server Error: Internal Server Error for url: http://nuclio:8070/api/function_invocations" maybe CVAT should try execute function again instead of stopping process. I will open issue in CVAT repo.

liranbg commented 2 years ago

Interesting. I've been trying to reproduce but no success yet. Nuclio dashboard ensure container is running before execute any commands over nuclio-local-storage-reader.

Perhaps it has been fixed by latest versions. Id ask you check against 1.8.8 and see? anyway, introducing features on 1.5.x is something we will not do in the foreseen future.