microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

Forever running job caused by out of sync between api server and database #5065

Open hzy46 opened 3 years ago

hzy46 commented 3 years ago

Issue Description:

Sometimes, etcd data could be broken, e.g. DRI deletes some frameworks manually, or the data are all lost. For most running jobs, their requestSynced=true. So the database controller assumes they are all synchronized with api server and won't check them any more. If their records are actually deleted in api server, these jobs will be in Running status forever. In addition, api server and database will be out of sync for these jobs.

Workaround & suggestion:

In most cases, admin/DRI should not touch framework data in API server directly. Any add/update/delete should use rest-server. If admin leaves the etcd data untouched, this issue can be avoided.

However, if this issue happens, the workaround is that:

  1. admin manually connect to the database:
    apt update
    apt install postgresql-client
    # default user/password is root/rootpass
    psql -h <PAI-master-ip> -U <user> -W openpai
  2. Set these jobs' requestSynced to requestSynced=false.
UPDATE frameworks SET "requestSynced"=false WHERE <please select the jobs>

If all the data are lost in etcd, use the following SQL sentence:

UPDATE frameworks SET "requestSynced"=false WHERE "requestSynced"=true and "apiServerDeleted"=false and "subState" != 'Completed'

Possible solutions for this problem:

  1. Provide a recover-from-database mode. If admin loses all data, he/she can manually turn this mode on. In this mode, we do UPDATE frameworks SET "requestSynced"=false WHERE "requestSynced"=true and "apiServerDeleted"=false and "subState" != 'Completed' for the user.

  2. When framework watcher starts, it lists all framework objects from api server. We can compare them with the frameworks in database. If we find there is any framework satifies: 1. apiServerDeleted=false 2. requestSynced=true 3. state!=Completed 4. Records in database and api server are different, or the api server record is missing, we can set its requestSynced=false.

  3. Do 2 periodically in database poller. Pro: we can handle this issue during normal time Cons: bring overhead

hzy46 commented 3 years ago

Another problem related to out of sync:

  1. If a job is completed, someone updates its spec. It will cause it re-created in the api server.

  2. It is caused by the short-cut in merge writer.

  3. Currently, this problem is minor. Because rest-server can only update one field in job spec: set spec.executionType = 'Stop'. This will only cause the job to be stopped and deleted in api server.

We can: 1. Reject job spec modifying request after a job is completed 2. Or we can accept the request, but not sync it to api server.