microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

[Feedback v0.14.0]-Can't submit the job since submit.bind.js:85 errors #4811

Open victorming opened 4 years ago

victorming commented 4 years ago

I deployed the OpenPAI v0.14.0 on the cluster. Every service is OK in k8s console. But when I submit a job in the web-portal, the button of "submit" keeps disabled. I checked the web console of the page and found the following errors: GET "http://MASTER_IP:9186/api/v1/kubernetes/api/v1/namespaces/pai-storage/secrets/storage-config" 404 (not found) GET "http://MASTER_IP:9186/api/v1/kubernetes/api/v1/namespaces/pai-storage/secrets/storage-server" 404 (not found) Above errors occurred at submit.bind.js:85. I checked the source of submit.bind.js and find the script will fail when not get the above resources. Anybody can help clear this issue?

victorming commented 4 years ago

I created the namespace of 'pai-storage' and added manually the two secrets of 'storage-config' and 'storage-server' in this namespace. Now the 404 errors have disappeared, but the Submit button is still disabled. Is't possible the admin user is not effected?

Binyang2014 commented 4 years ago

Please refer following lines: https://github.com/microsoft/pai/blob/5428c17e1d0d55325368ad85a3ab6b236e8de31a/src/rest-server/deploy/secret-create.sh#L21-L34

I think you'd better restart the rest-server by ./paictl service stop -n rest-server then ./paictl service start -n rest-server

victorming commented 4 years ago

I didn't find the bash file: pai/src/rest-server/deploy/secret-create.sh in the code of openpai@0.14.0. Are you sure this file needs to be executed when the container is being created? Screenshot from 2020-08-26 09-08-09

victorming commented 4 years ago

I created manually the namespace of 'pai-stoarge', and created the two secrets following the script file. Then I restart the webportal, but I still can't submit one simple job. See below: Screenshot from 2020-08-26 09-22-14 The "Submit" button keeps disabled. Any comments?

Binyang2014 commented 4 years ago

Please click the Edit Yaml button, and see if there are some errors. Maybe you missing some fields. And please provide you whole job config, you can see it after click Edit Yaml

victorming commented 4 years ago

Thanks! The job can be submitted successfully. Seems the YAML file's format is more restricted and much harder to submit the job. It would be better the Job Submit page hightlights the missed or error input fields.

Unfortunately, the job still failed with another reason:

Launcher AM failed to heartbeat with YARN RM due to YarnException, maybe App is non-compliant
Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=4096, maxMemory=1024 

I remembered I adjusted down some memory request parameters when deployed services. But now I don't know which parameter will cause this issue -- "maxMemory=1024". I adjusted the parameters of yarn-frameworklauncher, hadoop-resource-manager and other services, but it can't work. Do you know where it is?

Binyang2014 commented 4 years ago

@yqwang-ms Any comments about this?

yqwang-ms commented 4 years ago

Pls increase the yarn.scheduler.maximum-allocation-mb for hadoop-resource-manager. It is too small now, i.e. the maxMemory=1024


yarn.scheduler.maximum-allocation-mb | The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.