rancher / os

Tiny Linux distro that runs the entire OS as Docker containers
https://rancher.com/docs/os/v1.x/en/
Apache License 2.0
6.44k stars 655 forks source link

Latest rancherOS ISO freeze at boot time on virtualbox #389

Closed ArKam closed 9 years ago

ArKam commented 9 years ago

Hi guys,

I'm currently setting up a RancherOS host on VBox v4.3.28 from the latest RancherOS iso available on your website https://releases.rancher.com/os/latest/rancheros.iso (v0.3.1) at this time.

I do want to setup this host for a demo of what RancherOS can achieve and how it could help us to renew our production infrastructure (around 100 hosts from AWS and BareMetal).

However, I'm facing a quite strange error involving 9Pnet device. After installing the iso to /dev/sda everything seems OK and rancherOS ask for reboot. Once rebooted the host hang indefinitely with the following statement:

9pnet: Could not find request transport: virtio.

the used cloud-config.yml is available at the following URL: https://github.com/ArKam/seed/blob/master/cloud-config.yml

So, could you help me regarding this error?

imikushin commented 9 years ago

I'll look into this.

On Tue, Jun 16, 2015, 00:45 arkam notifications@github.com wrote:

Hi guys,

I'm currently setting up a RancherOS host on VBox v4.3.28 from the latest RancherOS iso available on your website https://releases.rancher.com/os/latest/rancheros.iso (v0.3.1) at this time.

I do want to setup this host for a demo of what RancherOS can achieve and how it could help us to renew our production infrastructure (around 100 hosts from AWS and BareMetal).

However, I'm facing a quite strange error involving 9Pnet device. After installing the iso to /dev/sda everything seems OK and rancherOS ask for reboot. Once rebooted the host hang indefinitely with the following statement:

9pnet: Could not find request transport: virtio.

the used cloud-config.yml is available at the following URL: https://github.com/ArKam/seed/blob/master/cloud-config.yml

So, could you help me regarding this error?

— Reply to this email directly or view it on GitHub https://github.com/rancherio/os/issues/389.

ArKam commented 9 years ago

Thanks a lot, for information, I've disable Sound and USB Port/channels, the used storage controller is a SATA controller and have been tested with SAS with the same error.

My network adapter is an Intel PRO/1000 T Server in bridge mode.

EDIT: I also tried using virtio interface in NAT mode with the exact same behavior.

ArKam commented 9 years ago

I tried to create a new VM on virtualbox using the script provided at https://github.com/rancherio/os/blob/master/scripts/build-vbox-vm I've got the exact same issue.

For information I'm running VBox on a Mac OS X 10.10

Xe commented 9 years ago

I ran into this last night. Wait about 30 seconds and it will resolve itself.

ArKam commented 9 years ago

nop, still hanging after 15 minutes waiting in front of the VM.

imikushin commented 9 years ago

@ArKam sorry for delay with this issue. What are the exact steps to reproduce? I'm using the latest OS X (10.10.3) and VirtualBox (4.3.28) and can't see any problem running RancherOS v0.3.1.

Also, you might want to use "virtio" network adapters (instead of "Intel PRO/1000 T Server") in VirtualBox: they are faster because there's no hardware to emulate.

ArKam commented 9 years ago

Hi @imikushin , ok so I'm running VBox 4.3.28 on a Mac OSX 10.10.4 and I've migrate to the virtio driver.

The process is pretty simple, download the v0.3.1 iso (or latest at this time). Create a VM on VBox using your recommandation (USB port Disable, Sound Disable, Virtio driver for network interface). Start the VM with the ISO as boot device. execute: sudo rancheros-install -c ./cloud-init.yml -d /dev/sda Wait for completion. Eject the ISO. sudo reboot. Wait for ever alone :D

imikushin commented 9 years ago

Okay. I repeated these steps and actually saw the message about 9pnet. But my VM isn't blocked. Can you ping the VM IP address? And try to ssh into it. Use hostonly network with DHCP.

On Thu, Jun 18, 2015, 20:47 arkam notifications@github.com wrote:

Hi @imikushin https://github.com/imikushin , ok so I'm running VBox 4.3.28 on a Mac OSX 10.10.4 and I've migrate to the virtio driver.

The process is pretty simple, download the v0.3.1 iso (or latest at this time). Create a VM on VBox using your recommandation (USB port Disable, Sound Disable, Virtio driver for network interface). Start the VM with the ISO as boot device. execute: sudo rancheros-install -c ./cloud-init.yml -d /dev/sda Wait for completion. Eject the ISO. sudo reboot. Wait for ever alone :D

— Reply to this email directly or view it on GitHub https://github.com/rancherio/os/issues/389#issuecomment-113198091.

ArKam commented 9 years ago

@imikushin , I'm really sorry, but even with the hostonly network selected as network method, I'm still hanging indefinitely on the 9pnet message. Do you want I push you my VM ?

imikushin commented 9 years ago

Sure. Just give me a URL where I can get it. You can export the VM in virtualbox to an archived format.

On Fri, Jun 19, 2015, 12:20 arkam notifications@github.com wrote:

@imikushin https://github.com/imikushin , I'm really sorry, but even with the hostonly network selected as network method, I'm still hanging indefinitely on the 9pnet message. Do you want I push you my VM ?

— Reply to this email directly or view it on GitHub https://github.com/rancherio/os/issues/389#issuecomment-113405324.

ArKam commented 9 years ago

OK, many thanks here is the OVA: https://github.com/ArKam/issues/blob/master/rancheros.ova

faisyl commented 9 years ago

Is this because it's waiting for the ubuntu console to download ? I ran into the same problem when I first tried it on a bandwidth constrained VM install.

ArKam commented 9 years ago

well, I've not set any bandwith restriction and I'm using a 10GBps/10GBps link as we do have huge internet access at work, so it doesn't seems to be the same issue as you.

imikushin commented 9 years ago

@ArKam We've released RancherOS v0.3.3 recently. Can you check if you're still experiencing the issue with RancherOS v0.3.3? If this is still the case, can you attach your exported VM for me to check?

ArKam commented 9 years ago

No problems, I'll do it tomorrow morning ;-)

jest commented 9 years ago

I have the same problem as OP, but running 0.3.3 under ESXi 6.0. What is interesting, the machine is available on the network (can be pinged), but SSH is not available

deniseschannon commented 9 years ago

@jest Did you run rancheros-install before trying to ssh in?

http://os.docs.rancher.com/docs/running-rancheros/server/install-to-disk/

jest commented 9 years ago

Of course, everything as in the manual.

In the end I managed to run the installed RancherOS, so it is not related to VMWare: 9p warnings are still reported, but everything works.

I'll investigate more; the next suspect is cloud-config

jest commented 9 years ago

I'm sorry, I cannot recreate the problem. I can only assure you that there was some problem; at the end I suspect some strange combination of parameters of virtual machine (unfortunately I have removed the original problematic VM...)

ArKam commented 9 years ago

Ok, so sorry for the late, but I'm quite buzzy those days :D - Well, you'll find an updated version of the VM (Complete recreation using latest version) hanging at the exact same moment.

https://github.com/ArKam/issues/blob/master/rancheros-0.3.3.ova

Bhlowe commented 9 years ago

I'm seeing this on my ESXI 5.5.0. Booting from the rancheros.iso with an unformatted drive worked fine. I ran rancheros-install using a simple cloud-config (1 pub key only, no network settings as I wanted dhcp.)

Once the new (bad) install is on the disk, I can't boot back into the virtual CD based rancher.iso file. So I can't change the install options to try to reinstall a different version or network setup.

jest commented 9 years ago

@Bhlowe If you want to boot from CD while having bootable HD, you have to change the order of booting devices in virtual BIOS. Depending on vmWare client you should see an option like "Enter BIOS after boot" somewhere.

Bhlowe commented 9 years ago

Thanks @Jest, figured that out (boot options menu in esxi) but.. still unable to get rancheros boot past that 9pnet: Could not find request transport: virtio message. I have installed CoreOS from iso so my machine isn't completely weird. I can let anyone see this via teamviewer if anyone wants to debug.

The error message gets called twice.. first time continues through after 5 seconds, the second time it hangs forever..

rancher-boot

FYI, I have two ethernet ports on this ESXI but only one should be accessible to the image. (E1000)

pleegor commented 9 years ago

Having the same issue while running everything on bare metal...

5m1l3 commented 9 years ago

+1 Have same issue on bare metal after installation. But livecd iso running fine.

Have some ideas?

5m1l3 commented 9 years ago

I found this thread https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg01330.html I unpack initramfs from current rancheros, and i think adding 9pnet_virtio to it, can solve this trouble.

valqk commented 9 years ago

I've had the same problem with Virtualbox. I have found where it came from in my case. It turned out that if I add user from cloud-config file I get virtio error. Actually it's just that booting hangs to this last error. When working I still get it, but afterwards I see the login. If I just have hostname and ssh-keys only, the installed os works ok.

deniseschannon commented 9 years ago

@valqk Currently RancherOS doesn't support adding in users into the cloud config file. I've tried to add some notes into our documentation to make it more apparent that it's not supported. We are looking to add in users and have an issue open to track it: https://github.com/rancher/os/issues/263

Bhlowe commented 9 years ago

I tried again with a simple cloud-config.yml that just had ssh keys and hostname and it failed the same way. I think @5m1l3 is on the right track about adding 9pnet_virtio to the initramfs, repack, and retry.

ArKam commented 9 years ago

This is not an issue regarding the users as I don't have any on it, but as you pointed it out, it MAY be a problem regarding Cloud-init waiting infinitely from it latest error.

I'll check to install it without cloud-config file and see if the problem is back.

valqk commented 9 years ago

@deniseschannon - yeah I've noticed after that and that's why removed users from it. Just notified how I 've fixed my issue. Though I'm still getting the virtio error, but afterwards I get login prompt.

ArKam commented 9 years ago

Any news regarding this bug? Regarding the fact that it is reproducible by customers on bare-metal installation, it should be one of your top priority as it deeply impact the way new potential customers will advocate or not for your product.

Bhlowe commented 9 years ago

I'm happy to test if someone knows how to add 9pnet_virtio and repackage. That should do the trick.

Brad

Sent from my iPhone

On Jul 25, 2015, at 5:05 PM, 5m1l3 notifications@github.com wrote:

i think adding 9pnet_virtio to it, can solve this trouble.

ghost commented 9 years ago

This is still an issue as I have tried a ton of different installs with no dice. I am on ESXi 6 running hardware version 9 with minimal cloud-config.yml file.

ibuildthecloud commented 9 years ago

Sorry if it seems like we aren't working on this issue. We are about to release RancherOS v0.4 which is a major change to address a lot of the usability concerns of RancherOS. One big problem is that when startup fails there is very little information printed such that it's quite hard to troubleshoot. There is a plethora of root causes that could be causing all of the issues listed and in most situations the 9p error message is completely unrelated.

@kpelt You said you have seen this issue on tons of different installations. What is the easiest way to reproduce it? Do you see this in VMware Desktop/Fusion by chance?

ghost commented 9 years ago

I mean basically like I am insane, "doing the same things over and over again expecting a different result", with minor changes each time. I am only doing this in a lab with ESXi on HP blades running vSphere 6. I have not tried it on other installations because my end goal is to have it inside my lab where I can build out much more functionality in the future.

ghost commented 9 years ago

I'll have to use another OS for the time being because I don't want to build on top of something that on reboot could wipe me out. I'll just wait for the next version.

imikushin commented 9 years ago

I think I might know what was causing the freeze: I've been playing with @ArKam's OVA (thanks @ArKam and sorry for the unthinkable delay with this) and tried to use it with rancheros.iso "inserted": this way, the OS boots from the ISO, but uses the virtual HDD as the state drive.

One thing that's lost after installing to HDD is the log messages of services going up. Looks like the error messages are gone for good, too. While booting with the ISO installed I can see that 'cloud-init-pre' and 'cloud-init' fail to run (exit with status code 1).

imikushin commented 9 years ago

Also, what cloud-config are you using to install RancherOS? That might help find the root cause of this.

ArKam commented 9 years ago

@ibuildthecloud, I'm really pleased to hear you back on this topic as I personally really put a large amount of optimism on your platform !

How can we help you to solve this issue ? I've provided my VM twice and can do it again if needed.

imikushin commented 9 years ago

@ArKam Can you please provide the cloud-config that you used with rancheros-install?

ArKam commented 9 years ago

@imikushin \o/ Glad I could help and that I hear you back too!! I know that you guys are working hard and we are in the middle of the summer time, this is why I don't ping earlier. Anyway, thanks a lot for those informations, I'll give you my cloud-config file as soon as I'll be back home.

ArKam commented 9 years ago

@imikushin Hi, here is the cloud-config file: https://github.com/ArKam/seed/blob/master/cloud-config-rancheros.yml

imikushin commented 9 years ago

Sorry for not commenting on this thread for so long.

The issue was caused by failing to boot with incorrect cloud-config. RancherOS should have at least ignored the incorrect cloud-config, but continue to boot anyway, so that behaviour was clearly wrong.

I'm finally pleased to announce that this issue fixed in v0.4.0 and the release is going to be out very soon - hopefully, this week.

Thanks everyone and personally @ArKam for providing feedback and for your patience.

ArKam commented 9 years ago

Thanks a lot to the rancheros team for your dedication and the way you worked on this issue. It may have been a little long sometime but the result worst the waiting ;-)

mbettan commented 9 years ago

I have exactly the same issue with ESXi 5.5 and rancherOS 0.3.3

Below is my cloud init file, any idea why?

#cloud-config
ssh_authorized_keys:
  - ssh-rsa 
AAAAB3NzaC1yc2EAAAABJQAAAQEAkly0yQzYjKaq8QpMgR9vq+zn2ibmiT55DONn
HSKxqLhSgWa0r4/xLm5Hb/KsbnPYzXejiuRD7Wkn9qSVrI/b/D1xDcBy6CZ9qifG
Gr724pIuuL/9OmKYT+WJNUTdRR2iPczvwwOyH5eXf8A4wLc9jaAF6cYlRanCkgYf
XlBXCmcVmctzdyH5aw/aqr4Rqy0K5MdEferc7RlYE25WAg9oYv6NWhBn2VFvyZ2N
0+OfDU1rQE7ZxIZLTMKYi/xUIjenkLC8BtEkxqW1zCSmU/LUC0jGBFAtJINBbCAa
OJbx8WKk+QBxxTOl9jp9dLp5xpKdQemrkp7nBdEd/WGMKBaf7Q

rancher:
  network:
    interfaces:
      eth*:
        dhcp: false
      eth0:
        address: 172.16.1.200/16
        gateway: 172.16.1.1
        mtu: 1500
    dns:
      nameservers:
        - 8.8.8.8
        - 8.8.4.4

hostname: rancher
ibuildthecloud commented 9 years ago

@mbettan We are right on the heels of releasing v0.4. Do you want to try v0.4-rc11 and see if it works?

mbettan commented 9 years ago

screenshoterror

Not working with 0.4-rc11 @ibuildthecloud

ibuildthecloud commented 9 years ago

okay. Have you by chance tried vmware desktop? just curious if that fails too. It will take a bit longer to validate ESXi because we don't have an environment handy. But I believe I can deploy ESXi in VMware Desktop

mbettan commented 9 years ago

No, I didn't try as we would like to POC rancherOS within ESXi Clusters for this project.

Most of your future users will use ESXi hypervisors (or Hyper-V / KVM / Xen) or baremetal as enterprise grade solution. People using VMware workstation it's only for testing locally, so it would be good to add hypervisors testing phase before releasing it.