microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

Server showing node as unusable #19

Closed vidit-bhatia closed 6 years ago

vidit-bhatia commented 6 years ago

My server is showing the nodes as unusable . The server use to work just few days back . I tried going on to the node as described in https://github.com/Azure/BatchAI/issues/9 .

image

There was nothing mentioned other than that there is an internal error. Is there any way I can start my server or fix these nodes. {"Code":"InternalError","Message":"","Category":"InternalError","ExitCode":1,"Details":null}

AlexanderYukhanov commented 6 years ago

Thanks for reporting the problem. can you please provide stdout.txt and stderr.txt from /mnt/batch/tasks/startup/ for investigation? You can solve the problem by resizing the cluster to 0 and back to 2. az batchai cluster resize -n -g -t 0 az batchai cluster resize -n -g -t 1

Thanks, Alex

AlexanderYukhanov commented 6 years ago

vidit-bhatia, I see you still have one node in unusable state. You probably would like to delete it if you are not using it via ssh, because it's still allocated and is considered to be used by you (so, it will be included in the bill). You can just set min size for your cluster to 0 to delete nodes when you are not using them.

AlexanderYukhanov commented 6 years ago

Please note, system checks if it needs to resize cluster every 5 mins. So, it can take up to 5 mins for BatchAI to start nodes allocation after you submit a job.

AlexanderYukhanov commented 6 years ago

vidit-bhatia. Can you please recreate your cluster? The issue is that your cluster was created before the recent ubuntu meltdown patch and kernel update. Now when your cluster is trying to allocate nodes it gets new kernel but old drivers.

vidit-bhatia commented 6 years ago

@AlekseiPolkovnikov I will look into it on Monday see how that can be done as the cluster is used already by some people.

AlexanderYukhanov commented 6 years ago

We have implemented a workaround on our side to make nodes after resize to pick up new drivers. So, you may keep the cluster and just make sure that all your unusable nodes removed

vidit-bhatia commented 6 years ago

@AlexanderYukhanov Seems like the workaround does not work

AlexanderYukhanov commented 6 years ago

what is happening?

vidit-bhatia commented 6 years ago

image

Same as soon as the node starts it become unusable

AlexanderYukhanov commented 6 years ago

taking a look

AlexanderYukhanov commented 6 years ago

Now it's a different issue - "Blob fuse mounting failed". Can you please check account name, key and container name?

vidit-bhatia commented 6 years ago

Looking into it

vidit-bhatia commented 6 years ago

@AlexanderYukhanov the python API s does not allow me to update mount settings? Do I need to delete and recreate server again

AlexanderYukhanov commented 6 years ago

Yes, it's not possible to change mount settings after cluster has been created.