microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.61k stars 546 forks source link

How to backup and restore user data stored by rest-server. #5786

Closed siaimes closed 2 years ago

siaimes commented 2 years ago

Organization Name:

Short summary about the issue/question: My certificate expired, but I got a cluster crash when renewing the certificate. Now I need to reset and reinstall the cluster, but all user data disappeared after this operation, and there is no backup and recovery solution found on GitHub.

https://github.com/microsoft/pai/blob/b50f29169d8d9832e06b1878a437ebbad58ccdfe/src/rest-server/deploy/rest-server.yaml.template#L164-L188

It seems that rest-server does not mount any directory, so where is its data stored? How can I backup and restore it?

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

Anything else we need to know:

Binyang2014 commented 2 years ago

If the db file not be deleted, you can recover the data. Here is a guide for this. https://openpai.readthedocs.io/en/latest/manual/cluster-admin/troubleshooting.html#how-to-solve-the-problem @hzy46 Can you help to take a look?

siaimes commented 2 years ago

If the db file not be deleted, you can recover the data. Here is a guide for this. https://openpai.readthedocs.io/en/latest/manual/cluster-admin/troubleshooting.html#how-to-solve-the-problem @hzy46 Can you help to take a look?

User data doesn't seem to be stored here, job data is stored here.

After I reset and installed the cluster, the job data still existed, but the user data was gone, including username, password, e-mail, SSH public Keys et. al.

siaimes commented 2 years ago

I see that user information and group information are stored in the Secret, so now the problem seems to be how to backup and restore the Secret of k8s.

Binyang2014 commented 2 years ago

You are right, if you delete the data file fot etcd, then user/group info will be lost. We need to dump secrets first then apply them to the new cluster

siaimes commented 2 years ago

So running the following command will reset the cluster, but all etcd data will be lost, please be careful.

ansible-playbook -i inventory/pai/hosts.yml -e "ansible_python_interpreter=/usr/bin/python3" reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml"