microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.61k stars 546 forks source link

Can not receive job status change message #5809

Closed HaoLiuHust closed 1 year ago

HaoLiuHust commented 1 year ago

Organization Name:

Short summary about the issue/question:

recently, I can not receive job status change message, the error log of alter-manager is like below: 2022-10-25T13:18:52.018Z [ERROR] Failed when handle job status change for job test-code-server_18304946: meta = { "message": "read ECONNRESET", "stack": "Error: read ECONNRESET\n at TCP.onStreamRead (internal/stream_base_commons.js:111:27)", "config": { "url": "http://10.1.9.53:80/alert-manager/api/v1/alerts", "method": "post", "data": "[]", "headers": { "Accept": "application/json, text/plain, /", "Content-Type": "application/json", "User-Agent": "axios/0.21.1", "Content-Length": 2 }, "transformRequest": [ null ], "transformResponse": [ null ], "timeout": 0, "xsrfCookieName": "XSRF-TOKEN", "xsrfHeaderName": "X-XSRF-TOKEN", "maxContentLength": -1, "maxBodyLength": -1 }, "code": "ECONNRESET" }

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

Anything else we need to know:

HaoLiuHust commented 1 year ago

fix it by modify job status notification source code to skip empty alters image