Closed vidit-bhatia closed 6 years ago
Thanks for reporting the problem. can you please provide stdout.txt and stderr.txt from /mnt/batch/tasks/startup/ for investigation? You can solve the problem by resizing the cluster to 0 and back to 2.
az batchai cluster resize -n
Thanks, Alex
vidit-bhatia, I see you still have one node in unusable state. You probably would like to delete it if you are not using it via ssh, because it's still allocated and is considered to be used by you (so, it will be included in the bill). You can just set min size for your cluster to 0 to delete nodes when you are not using them.
Please note, system checks if it needs to resize cluster every 5 mins. So, it can take up to 5 mins for BatchAI to start nodes allocation after you submit a job.
vidit-bhatia. Can you please recreate your cluster? The issue is that your cluster was created before the recent ubuntu meltdown patch and kernel update. Now when your cluster is trying to allocate nodes it gets new kernel but old drivers.
@AlekseiPolkovnikov I will look into it on Monday see how that can be done as the cluster is used already by some people.
We have implemented a workaround on our side to make nodes after resize to pick up new drivers. So, you may keep the cluster and just make sure that all your unusable nodes removed
@AlexanderYukhanov Seems like the workaround does not work
what is happening?
Same as soon as the node starts it become unusable
taking a look
Now it's a different issue - "Blob fuse mounting failed". Can you please check account name, key and container name?
Looking into it
@AlexanderYukhanov the python API s does not allow me to update mount settings? Do I need to delete and recreate server again
Yes, it's not possible to change mount settings after cluster has been created.
My server is showing the nodes as unusable . The server use to work just few days back . I tried going on to the node as described in https://github.com/Azure/BatchAI/issues/9 .
There was nothing mentioned other than that there is an internal error. Is there any way I can start my server or fix these nodes. {"Code":"InternalError","Message":"","Category":"InternalError","ExitCode":1,"Details":null}