nix-community / nixos-anywhere

install nixos everywhere via ssh [maintainer=@numtide]
https://nix-community.github.io/nixos-anywhere/
MIT License
1.4k stars 98 forks source link

Cannot SSH into Hetzner bare metal after install #346

Open szethh opened 1 month ago

szethh commented 1 month ago

I am trying to bootstrap a heztner bare metal server with nixos. The installation process seems to go fine (no errors), but after the machine reboots it becomes unreachable (pings, ssh, etc).

I then have to setup the rescue system and send a ctrl+alt+del to get access to it again.

My config is here https://github.com/szethh/nixie/blob/main/hosts/htz/default.nix

The command I use to run it is: nix -v run github:nix-community/nixos-anywhere -- --debug -L root@ip --flake .#htz --build-on-remote

I need to use build-on-remote since I am on macos.

The machine is an Intel, with 2 sata HDDs, if that matters.

I am quite new to this project, to disko, and nixos in general, so I might be overlooking something obvious here 😅

johanot commented 1 month ago

Same here. Tried everything today:

nothing works. Seems almost like some network issue at Hetzner, but that's just weird.

The only strange thing I could see while debugging the kexec-installer was the it seems like e1000e negotiated 10Mb/s full-duplex.

szethh commented 1 month ago

I tried the suggestions from @Mic92 on #110 but no luck.

Maybe something changed recently from hetzner's side?

In any case do let me know if you find a fix, this is quite puzzling

johanot commented 1 month ago

I've been compiling kernels most of the day, but alas.

The only breakthrough I had is that everything works with non-intel nics, i.e. Realtek (which is generally in the AMD-series).

It looks like e1000e (intel nics) fails to negotiate link properly after kexec. After reboot everything is fine. It only happens in the kexec environment. I also tried unloading and loading the driver again inside the kexec env - no luck.

johanot commented 1 month ago

@szethh The final test I wanted to make for today actually made it work! (for me at least)

--kexec-extra-flags "--no-ifdown"

szethh commented 1 month ago

did you many any other changes to your config?

i tried that flag but i still can't ssh or ping the machine in any way... these are the last logs i get before it reboots and goes dark:

+ kexecSyscallFlags=--kexec-syscall-auto
+ sh -c '/root/kexec/kexec/kexec' --load '/root/kexec/kexec/bzImage'   --kexec-syscall-auto   --no-ifdown   --initrd='/root/kexec/kexec/initrd' --no-checks   --command-line 'init=/nix/store/w5967zp4vrgi8hhsyzb6xv6pv02182j2-nixos-system-nixos-installer-24.05pre-git/init console=tty0 console=ttyS0,115200 root=fstab loglevel=4'
machine will boot into nixos in 6s...
+ echo machine will boot into nixos in 6s...
+ test -e /dev/kmsg
+ exec
+ timeout_ssh_ -- exit 0
+ timeout 10 ssh -i /tmp/tmp.D2PJmr8Y50/nixos-anywhere -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@ip -- exit 0
Warning: Permanently added 'ip' (ED25519) to the list of known hosts.
+ ssh_connection=root@ip
+ ssh_ -o ConnectTimeout=10 -- exit 0
+ ssh -t -i /tmp/tmp.D2PJmr8Y50/nixos-anywhere -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@ip -o ConnectTimeout=10 -- exit 0
ssh: connect to host ip port 22: Operation timed out
+ sleep 5
...
# repeated ssh attempts, all timed out
johanot commented 1 month ago

It's crazy because I might have actually triggered a race-condition yesterday. Today nothing works again. :(

johanot commented 1 month ago

Update:

I spend an entire day debugging. Tried unloading and loading the e1000e driver multiple times, even tried asking the driver to reset the NIC and many other things, but I have limited knowledge - and time, especially when the virtual consolve (KVM) is difficult to get for larger periods in Hetzner. I also asked Hetzner directly for help, but they said it smells like software, which they don't provide support for.

So instead of fighting the kexec+intel+kexec problem even more, I went with a modified version of nixos-anywhere: https://github.com/johanot/nixos-anywhere/commit/2c804da6062055d52fab08340de76a1ae3578bf1 that uses the rescue system itself as partitioning/install environment, instead of kexec. It's not very far from just re-writing my own install-script from scratch, but maybe with the right cleanup and parametrization, it can become a PR to nixos-anywhere some day. - e.g --no-kexec + --prepare-installer-script :shrug:

Mic92 commented 1 month ago

I once use this script on hetzner on an intel machine that did not support kexec: https://github.com/nix-community/nixos-anywhere/issues/136 It modifies their rescue system to pretend to be a nixos installer.

Mic92 commented 1 month ago

Apparently there is also hardware (usually PCI devices) that doesn't support kexec because it cannot be restarted. I believe the Linux port on ARM macbooks for example these issue.

johanot commented 1 month ago

@Mic92 Thanks! The funny thing here is that I'm 100% certain that my 4-5 intel machines at Hetzner worked with kexec without issues like 3-4 weeks ago... So I wonder what changed... Trying different kernels doesn't help, so I lean towards either new firmware or changed switch configuration at Hetzner... But yeah, we can workaround it for sure.

szethh commented 1 month ago

johanot@2c804da

I tried this and the ssh issue went away thanks @johanot! now i'm having trouble with setting up a raid array but I think that's more of an issue relating to disko, not nixos-anywhere (nix-community/disko#705)