raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.08k stars 4.96k forks source link

Coding ip=169.254.3.14 hangs during boot if there's no cable connected #589

Closed DougieLawson closed 10 years ago

DougieLawson commented 10 years ago

I quite often connect my RPi to my Windows system with a direct cable using the 169.254.xxx.xxx address scheme. By assiging 169.254.3.14 I can easily find my RPi. It's a convenient way to work away from home to get connected before the WiFi (which often needs a password or web page interaction) is running.

If I bring the machine home and don't connect a cat5 cable to my home router (or connect to my laptop) then the kernel hangs during boot. I get the splash screen and the Raspberry logo and nothing more.

Looking in the code and /proc/config.gz I think I've found the cause.

The kernel config has CONFIG_ROOT_NFS=y So when we run in net/ipv4/ipconfig.c the retries count is ignored and we loop round

#ifdef CONFIG_ROOT_NFS
                        if (ROOT_DEV ==  Root_NFS) {
                                pr_err("IP-Config: Retrying forever (NFS root)...\n");
                                goto try_try_again;
                        }
#endif

which causes the boot to hang.

The quick resolution is simple

  1. Pull the card, edit cmdline.txt to drop the ip=169.254.3.14 and reboot
  2. Wire the ethernet to my laptop

The permanent fix is to reset CONFIG_ROOT_NFS and rebuild the kernel.

asb commented 10 years ago

But all the ip setting stuff is part of CONFIG_ROOT_NFS. e.g. the ip kernel command line parameter is part of nfsroot https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt Though it is (ab)used for other purposes, e.g. 9P2000 root.

Ultimately, you should be assigning an IP with userspace configuration (e.g. /etc/network/interfaces). I personally wouldn't consider the behaviour you've encountered a bug.

DougieLawson commented 10 years ago

It's a bug because it's well documented that you can assign an IP using the cmdline.txt ip=vvv.xxx.yyy.zzz and there's no checking that it's being used for an NFS rootfs. If the parm is only to be used with NFS then it should barf earlier or (probably not a good idea) silently ignore it.

I agree if the rootfs was on an NFS device that there's no point continuing, if the connection doesn't come active, but that should bail out with an oops after a few more turns round the loop (with retries set to a higher value). It's never sensible to solidly hang the boot in an endless loop.

It's a convenience thing to be able to set an IP address for the ethernet interface when you can't access the ext4 filesystem. My RPis normally run with fixed addresses from the 10.1.1.0/24 block so I could fiddle with the windows side to fix the IP address there, but popping the SDCard and updating cmdline.txt is easier (and if it didn't hang I could set it and forget it).

asb commented 10 years ago

Wait a minute, the code you post doesn't really explain the issue you're seeing. Why would ROOT_DEV == Root_NFS be true?

DougieLawson commented 10 years ago

That appears to be the only place in the code where we loop back to try_try_again without decrementing retries.

asb commented 10 years ago

Sure, though if ROOT_DEV is set to Root_NFS without passing root=/dev/nfs on the kernel command line it seems that's where the bug really is (and it seems it would be an upstream bug).

popcornmix commented 10 years ago

Removing CONFIG_ROOT_NFS is not an option. I do all my development with an nfs mounted rootfs, and it appears a very common configuration.

I assume (from your description) you are not seeing the message "IP-Config: Retrying forever (NFS root)..."? Are you sure that is where it gets stuck?

DougieLawson commented 10 years ago

I'm going to build a new kernel with debugging set (I may even put some extra messages in). I thought I'd found the hang from reading the code (which handles the ip=vvv.xxx.yyy.zzz parm).

I'd have thought giving it five minutes (rather than endless) before doing something to tell the user the boot isn't going to complete would have been a better design. Perhaps I'm too used to seeing IBM mainframe operating systems set disabled wait states when their initial program load can't continue.

DougieLawson commented 10 years ago

It's amazing what you see when you add some debugging with

#define IPCONFIG_DEBUG

There's a delay loop (120 seconds) before ipconfig.c gives up the ghost and carries on. I guess I've never been patient enough to wait that long before pulling the power and giving up. I could petition for

#define CONF_CARRIER_TIMEOUT    120000  /* Wait for carrier timeout */

to be made smaller or I could just accept that boot is going to hang for two minutes when I'm stupid enough to define ip=vvv.xxx.yyy.zzz but not connect a wire.