harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.72k stars 311 forks source link

Cannot start VMs, I've had a long journey to get to where I am. #5513

Closed prhiam closed 3 days ago

prhiam commented 4 months ago

I've had a tough time getting a Harvester node up, and I know that I've got only two nodes in the cluster, but I don't have a third to add at this time. Right now, I cannot start harvester VMs, the volumes always says degraded and replica scheduling failed. I really want to use and learn Harvester and Rancher, but this is starting to get frustrating after over 2 weeks just to stand the machines up. Can anyone give me an idea as to why it would be generating this error even though there is plenty of disk space for what I'm trying to do? I have 600 free on the machine and I'm asking for a disk size of 40.

Screenshot 2024-04-05 at 8 43 31 AM
RegisHubelia commented 4 months ago

Hi there, Can you go directly in the k8s cluster - in longhorn, and see what is going on there? Once logged in, you need to modify the url to get access to the actual underlying k8s cluster. As an example: https://x.x.x.x/dashboard/harvester/c/local/kubevirt.io.virtualmachine - remove the harvester and the kubevirt.io.... - so the url looks like https://x.x.x.x/dashboard/c/local/ - you should then see a longhorn tab in the left menu - it will give you a better idea of what is going on. If you can, provide screenshots of the volumes - and also, please provide a screenshot of one of the node "storage" section (in harvester - hosts - click the host, you will have a "storage" section.

prhiam commented 4 months ago

I guess I’m not sure how to troubleshoot properly, which is my fault. All I see is the disks there, then degraded. I tried attaching two volumes to a node and it says there are no scheduled replicas. I haven’t edited anything from my install, I’m not sure how to get scheduled replicas? Attached are all the screenshots. Thank you for showing me how to get into Longhorn though, I never knew how to get into the k8s cluster dashboard before, it’s very beneficial.

Any information you can give me would be great.

Thank you, Paul Hiam On Apr 5, 2024 at 9:07 AM -0600, RegisHubelia @.***>, wrote:

Hi there, Can you go directly in the k8s cluster - in longhorn, and see what is going on there? Once logged in, you need to modify the url to get access to the actual underlying k8s cluster. As an example: https://x.x.x.x/dashboard/harvester/c/local/kubevirt.io.virtualmachine - remove the harvester and the kubevirt.io.... - so the url looks like https://x.x.x.x/dashboard/c/local/ - you should then see a longhorn tab in the left menu - it will give you a better idea of what is going on. If you can, provide screenshots of the volumes - and also, please provide a screenshot of one of the node "storage" section (in harvester - hosts - click the host, you will have a "storage" section. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

RegisHubelia commented 4 months ago

Hey - I don't see your screenshots...

prhiam commented 4 months ago

Now? On Apr 5, 2024 at 9:24 AM -0600, RegisHubelia @.***>, wrote:

Hey - I don't see your screenshots... — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

RegisHubelia commented 4 months ago

Nop - ideally, go on github and add the screenshots there - as I can see that you replied via email.

prhiam commented 4 months ago

Sorry about that, here they are.

Screenshot 2024-04-05 at 9 17 26 AM Screenshot 2024-04-05 at 9 56 56 AM Screenshot 2024-04-05 at 9 18 57 AM Screenshot 2024-04-05 at 9 18 44 AM Screenshot 2024-04-05 at 9 17 15 AM Screenshot 2024-04-05 at 9 16 47 AM Screenshot 2024-04-05 at 9 14 11 AM
RegisHubelia commented 4 months ago

Can you please click on one of the volume in longhorn? It will show the nodes and replicas: image

And send a screenshot? Also, on your volumes, in longhorn, change the replicas count from 3 to 2 as you have only 2 nodes. If you are going to use only 2 nodes, create a new storage class and set the replica count to 2 - then use this storage class for your new vms - but for now setting the replica count to 2 should be fine for the existing ones.

prhiam commented 4 months ago

I've changed the replicas to 2 in the settings, that hasn't affected the main Storage Classes already created during install. Should I re-create the VM and see if it goes through with the new settings? Also, I had already used a new Storage Class and that one fails as well.

Screenshot 2024-04-05 at 10 58 18 AM
RegisHubelia commented 4 months ago

From your screen shot - it seems the replicas are still 3 - did you changed the setting to 2 for this volume?

prhiam commented 4 months ago

I changed it in Longhorn settings, not sure how to set it for the Storage Class on the harvester cluster VIP. On Apr 5, 2024 at 11:10 AM -0600, RegisHubelia @.***>, wrote:

From your screen shot - it seems the replicas are still 3 - did you changed the setting to 2 for this volume? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

RegisHubelia commented 4 months ago

Basically, in the longhorn dashboard - volumes tab - there is an hamburger menu on the right of each volume. Click on it - "Update Replicas Count" - change it to 2 - do that for all your current volumes. You can also do it in bulk by selecting all volumes - then click the hamburger menu on the top left of the screen next to Detatch - create backup - and you have the Update Replicas Count there so you can do them all in 1 go.

prhiam commented 4 months ago

Ok so done. But I still can't create VMs... In Longhorn one is still detached and can't be attached to a node without the error no replicas available. The other went in by default to 3 replicas again, I had to manually change it to get it to work. Should I reboot the nodes for the default replica counts to go to 2?

Screenshot 2024-04-05 at 11 31 12 AM
prhiam commented 4 months ago

Storage Classes still reading 3.

Screenshot 2024-04-05 at 11 34 10 AM
RegisHubelia commented 4 months ago

No - just clone the harvester-longhorn storage class to something like harvester-2-replicas or whatever, set the replicas count to 2 - when you create a VM, make sure you select that new storage class you just created - and do so for all other vms going forward. Just for the sake of simplicity - if your vms don't contain anything important, just delete them and recreate them using the new storage class you created. Let me know if all works after this. If not, please include new screenshots of the current state - including volume details (clicking on the volume in LH UI) - and the events on the dashboard of the K8S cluster interface.

prhiam commented 4 months ago

Both volumes failed, both volumes go into harvester detached. This is the frustration I've lived in for 2 weeks. I've also tried many of this before (from the storage class in Harvester UI). I've rebuilt the Cluster Nodes 3 times as well.

Screenshot 2024-04-05 at 12 09 47 PM Screenshot 2024-04-05 at 12 09 36 PM Screenshot 2024-04-05 at 12 09 28 PM Screenshot 2024-04-05 at 12 08 26 PM Screenshot 2024-04-05 at 12 08 15 PM
RegisHubelia commented 4 months ago

At this point - maybe it's better if someone from harvester hop in - and probably provide a support bundle for them to analyse... One thing you could try is to delete the instance managers pods and the scheduler pods in the longhorn-system namespace - see if that helps. If not - then at this point a support bundle would contain information needed to troubleshoot.

github-actions[bot] commented 2 months ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

Vicente-Cheng commented 2 months ago

Hi @prhiam, sorry for the late reply.

Did the situation still happen? If yes, could you generate the support bundle (REF: https://docs.harvesterhci.io/v1.3/troubleshooting/harvester/#generate-a-support-bundle) for investigation?

Thanks!

prhiam commented 2 months ago

As much as I wanted to use harvester for my hypervisor I finally just gave up on it. There were too many issues I kept running in to, from this issue which I resolved through a new install. To provisioning new hosts with cloud init images. It was non-stop issues after issue. Took way too much of my time building, so I just installed proxmox and haven’t looked back.

Thank you On Jun 4, 2024 at 9:22 PM -0600, freeze @.***>, wrote:

Hi @prhiam, sorry for the late reply. Did the situation still happen? If yes, could you generate the support bundle (REF: https://docs.harvesterhci.io/v1.3/troubleshooting/harvester/#generate-a-support-bundle) for investigation? Thanks! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

github-actions[bot] commented 2 weeks ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.