ltsp / ltsp

LTSP code, issues and discussions
https://ltsp.org
GNU General Public License v3.0
310 stars 59 forks source link

dnsmasq error with Debian Buster 64bit with 2 nics (solved) #11

Closed rkwesk closed 5 years ago

rkwesk commented 5 years ago

Older server with bios and two nics, enp1s2 for wan and enp1s5 for lan.

Clean install Debian Buster 64 bit. Debian installer expert install basic system only. Reboot and install xorg mate wget lightdm network-manager-gnome firefox-esr rsync. Comment out lines in /etc/network/interfaces of ethernet interfaces so that network manager controls both wired connections. Reboot again.

From previous ltsp5 experience I thought I needed to: Use nm-connection-editor to change enp1s5 to share with other computers and static ip 192.168.67.1

However, this caused an error later:

Follow strictly the install steps

wget https://ltsp.github.io/misc/ltsp-ubuntu-ppa-bionic.list -O /etc/apt/sources.list.d/ltsp-ubuntu-ppa-bionic.list
wget https://ltsp.github.io/misc/ltsp_ubuntu_ppa.gpg -O /etc/apt/trusted.gpg.d/ltsp_ubuntu_ppa.gpg
apt update
apt install ltsp dnsmasq nfs-kernel-server openssh-server squashfs-tools epoptes
gpasswd -a rkwesk epoptes

Now to the issue:

root@buster64dualnicltsp19:~# ltsp dnsmasq --proxy-dhcp=0
Failed to get global data: Unit dbus-org.freedesktop.resolve1.service not found.
LTSP command failed: systemd-resolve --status
Installed /usr/share/ltsp/server/dnsmasq/ltsp-dnsmasq.conf in /etc/dnsmasq.d/ltsp-dnsmasq.conf
Job for dnsmasq.service failed because the control process exited with error code.
See "systemctl status dnsmasq.service" and "journalctl -xe" for details.
LTSP command failed: systemctl restart dnsmasq
Aborting ltsp
root@buster64dualnicltsp19:~# cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 192.168.1.1
nameserver 10.72.251.10
nameserver fe80::b167:ea72:3350:81b8%enp1s2

root@buster64dualnicltsp19:~# systemctl restart dnsmasq.service
Job for dnsmasq.service failed because the control process exited with error code.
See "systemctl status dnsmasq.service" and "journalctl -xe" for details.
root@buster64dualnicltsp19:~# systemctl status dnsmasq.service |tail -20
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
   Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2019-08-24 19:43:36 EEST; 6s ago
  Process: 9859 ExecStartPre=/usr/sbin/dnsmasq --test (code=exited, status=0/SUCCESS)
  Process: 9860 ExecStart=/etc/init.d/dnsmasq systemd-exec (code=exited, status=1/FAILURE)

Aug 24 19:43:36 buster64dualnicltsp19 systemd[1]: Starting dnsmasq - A lightweight DHCP and caching DNS server...
Aug 24 19:43:36 buster64dualnicltsp19 dnsmasq[9859]: dnsmasq: syntax check OK.
Aug 24 19:43:36 buster64dualnicltsp19 dnsmasq[9860]: dnsmasq: bad IP address at line 33 of /etc/dnsmasq.d/ltsp-dnsmasq.conf
Aug 24 19:43:36 buster64dualnicltsp19 dnsmasq[9860]: bad IP address at line 33 of /etc/dnsmasq.d/ltsp-dnsmasq.conf
Aug 24 19:43:36 buster64dualnicltsp19 dnsmasq[9860]: FAILED to start up
Aug 24 19:43:36 buster64dualnicltsp19 systemd[1]: dnsmasq.service: Control process exited, code=exited, status=1/FAILURE
Aug 24 19:43:36 buster64dualnicltsp19 systemd[1]: dnsmasq.service: Failed with result 'exit-code'.
Aug 24 19:43:36 buster64dualnicltsp19 systemd[1]: Failed to start dnsmasq - A lightweight DHCP and caching DNS server.

As already discussed I edit line 33:

root@buster64dualnicltsp19:~# cat /etc/dnsmasq.d/ltsp-dnsmasq.conf
# This file is part of LTSP, https://ltsp.github.io
# Copyright 2019 the LTSP team, see AUTHORS
# SPDX-License-Identifier: GPL-3.0-or-later

# Configure dnsmasq for LTSP
# Documentation=man:ltsp-dnsmasq(8)

# For additional local dnsmasq configuration like DNS blacklisting, it's
# recommended to use separate /etc/dnsmasq.d/your-configuration.conf files,
# so that they're not lost if you ever (re)run `ltsp --overwrite dnsmasq`.

# port=0 disables the DNS service of dnsmasq
port=0

# enable-tftp enables the TFTP service of dnsmasq
enable-tftp

# FHS 2.3+ recommends /srv for tftp (debian #477109, LP #84615)
tftp-root=/srv/tftp

# Log lots of extra information about DHCP transactions
#log-dhcp

# IP ranges to hand out, usually on the internal LTSP subnet of 2-NIC setups
dhcp-range=192.168.67.20,192.168.67.250,12h

# If another DHCP server is present on the network, a proxy range may be used
# instead. This makes dnsmasq provide boot information but not IP leases.
#dhcp-range=set:proxy,192.168.0.0,proxy,255.255.255.0

# Specify the DNS server. 0.0.0.0 means the machine running dnsmasq.
# DNS_SERVER in ltsp.conf is preferred as it reaches proxy DHCP clients.
dhcp-option=option:dns-server,10.72.251.10

# Set some tags to be able to separate client settings later on.
# "39" means "recent iPXE with menu support": http://ipxe.org/howto/dhcpd
dhcp-match=set:iPXE,175,39
dhcp-match=set:X86PC,option:client-arch,0
dhcp-match=set:X86-64_EFI,option:client-arch,7
# Due to rfc4578 errata, sometimes BC_EFI=9 is misused instead of X86-64_EFI=7:
dhcp-match=set:X86-64_EFI,option:client-arch,9
dhcp-mac=set:rpi,b8:27:eb:*:*:*
dhcp-mac=set:rpi,dc:a6:32:*:*:*

# In proxy DHCP mode, the server ONLY sends its IP and the following filename.
# Service types: man dnsmasq or https://tools.ietf.org/html/rfc4578#section-2.1
# PXE services in non proxy subnets sometimes break UEFI netboot, so tag:proxy.
pxe-service=tag:proxy,tag:!iPXE,X86PC,"undionly.kpxe",ltsp/undionly.kpxe
pxe-service=tag:proxy,tag:!iPXE,X86-64_EFI,"snponly.efi",ltsp/snponly.efi
pxe-service=tag:proxy,tag:iPXE,X86PC,"ltsp.ipxe",ltsp/ltsp.ipxe
pxe-service=tag:proxy,tag:iPXE,X86-64_EFI,"ltsp.ipxe",ltsp/ltsp.ipxe
pxe-service=tag:rpi,X86PC,"Raspberry Pi",unused

# Specify the boot filename for each tag, relative to tftp-root.
# If multiple lines with tags match, the last one is used.
# See: https://www.syslinux.org/wiki/index.php?title=PXELINUX#UEFI
dhcp-boot=tag:!iPXE,tag:X86PC,ltsp/undionly.kpxe
dhcp-boot=tag:!iPXE,tag:X86-64_EFI,ltsp/snponly.efi
dhcp-boot=tag:iPXE,ltsp/ltsp.ipxe

# Proxy DHCP clients don't receive any DHCP options like root-path.
# So we set root-path in the kernel cmdline from ltsp.ipxe.
#dhcp-option=option:root-path,ipxe-menu-item

But this time I ran up against a new error:

root@buster64dualnicltsp19:~# systemctl restart dnsmasq.service
Job for dnsmasq.service failed because the control process exited with error code.
See "systemctl status dnsmasq.service" and "journalctl -xe" for details.
root@buster64dualnicltsp19:~# systemctl status dnsmasq.service |tail -20
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
   Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2019-08-24 19:47:22 EEST; 7s ago
  Process: 9886 ExecStartPre=/usr/sbin/dnsmasq --test (code=exited, status=0/SUCCESS)
  Process: 9887 ExecStart=/etc/init.d/dnsmasq systemd-exec (code=exited, status=2)

Aug 24 19:47:22 buster64dualnicltsp19 systemd[1]: Starting dnsmasq - A lightweight DHCP and caching DNS server...
Aug 24 19:47:22 buster64dualnicltsp19 dnsmasq[9886]: dnsmasq: syntax check OK.
Aug 24 19:47:22 buster64dualnicltsp19 dnsmasq[9887]: dnsmasq: failed to bind DHCP server socket: Address already in use
Aug 24 19:47:22 buster64dualnicltsp19 dnsmasq[9887]: failed to bind DHCP server socket: Address already in use
Aug 24 19:47:22 buster64dualnicltsp19 dnsmasq[9887]: FAILED to start up
Aug 24 19:47:22 buster64dualnicltsp19 systemd[1]: dnsmasq.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Aug 24 19:47:22 buster64dualnicltsp19 systemd[1]: dnsmasq.service: Failed with result 'exit-code'.
Aug 24 19:47:22 buster64dualnicltsp19 systemd[1]: Failed to start dnsmasq - A lightweight DHCP and caching DNS server.

I solved this by going back to nm-connection-editor and changing the lan connection to manual.

Then I reboot the server and dnsmasq.service starts without error.

The rest of the steps for the install do not report any error.

Richard

alkisg commented 5 years ago

Richard, AFAIK we can't use network manager's "connection sharing" feature, as it starts a DHCP server that doesn't offer a "boot filename". That's why in LTSP5 we provided an /etc/network/if-up.d/sch-scripts file that set up IP forwarding.

So, AFAIK this is still necessary, right? Or did you find a way around it where the internal NIC clients actually netboot and also have Internet access?

rkwesk commented 5 years ago

Not yet Alki. The client not finding an ip is my stumbling block now. Richard

On Saturday, August 24, 2019, 6:31:09 PM UTC, Alkis Georgopoulos <notifications@github.com> wrote:  

Richard, AFAIK we can't use network manager's "connection sharing" feature, as it starts a DHCP server that doesn't offer a "boot filename". That's why in LTSP5 we provided an /etc/network/if-up.d/sch-scripts file that set up IP forwarding.

So, AFAIK this is still necessary, right? Or did you find a way around it where the internal NIC clients actually netboot and also have Internet access?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

rkwesk commented 5 years ago

I changed the lan ip with network manager but the client still cannot get an ip.

I even reran ltsp initrd and then rebooted the server, but still the same log that the client cannot get an ip.

root@buster64dualnicltsp19:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp1s5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:1a:4d:25:fc:65 brd ff:ff:ff:ff:ff:ff inet 192.198.67.1/24 brd 192.198.67.255 scope global noprefixroute enp1s5 valid_lft forever preferred_lft forever inet6 fe80::9a15:7128:7759:3581/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: enp1s2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000 link/ether 00:01:02:05:e7:43 brd ff:ff:ff:ff:ff:ff inet 10.72.251.201/24 brd 10.72.251.255 scope global dynamic noprefixroute enp1s2 valid_lft 1813991sec preferred_lft 1813991sec inet6 2a02:587:2d24:1900:81b7:1ea7:c531:bbf7/64 scope global dynamic noprefixroute valid_lft 58855sec preferred_lft 58855sec inet6 fe80::11bd:1d2f:bf11:d9cf/64 scope link noprefixroute valid_lft forever preferred_lft forever

and

Aug 24 22:45:32 buster64dualnicltsp19 dnsmasq-dhcp[458]: no address range available for DHCP request via enp1s5 Aug 24 22:45:34 buster64dualnicltsp19 dnsmasq-dhcp[458]: no address range available for DHCP request via enp1s5 Aug 24 22:45:38 buster64dualnicltsp19 dnsmasq-dhcp[458]: no address range available for DHCP request via enp1s5 Aug 24 22:45:46 buster64dualnicltsp19 dnsmasq-dhcp[458]: no address range available for DHCP request via enp1s5

Richard

rkwesk commented 5 years ago

Sorry I put this in wrong issue but rewrote it in issue 12.

However, in answer to your remark here in issue 11 I want to say that with ltsp5 the 2 nic scenario works in Buster without using sch-scripts.

If you say that network manager cannot do a share with other computers in ltsp19 what about manually setting IPForward in systemd and manually adding iptable rules. How else will fat clients get to the Internet from another lan?

Richard

alkisg commented 5 years ago

Nothing changed from LTSP5 to LTSP19 wrt NAT etc. So if you think you had it working in LTSP5, it should work the same way in LTSP19.

What I'm saying is that AFAIK the method you describe for LTSP5 with network manager, without sch-scripts etc, shouldn't work, as network manager doesn't send a boot filename for the clients to boot with. So some combination of the things you did might have made it work, but not just network manager.

rkwesk commented 5 years ago

The relevant steps I followed specifically for the two nic scenario are:

Steps 1 through 5 and then 6(dual) through 9(dual) and finally 12 through 20 on https://wiki.debian.org/LTSP/Howto

I tested the server/client scenario 4 times and noticed that it worked 3 times. The one time it did not work the client could ping an external numerical ip address but not a named url so dns was not working. The workaround then was to restart network manager which allowed dns to work. My guess is that systemd (which defaults to ipforward disabled) was somehow activated after network manager whereas on the other times network manager was activated afterwards (race condition.)

The scenario is broken down in so many steps to ensure that a novice user can accomplish it. However, if you want to look I think you will agree that nothing else except network manager is dealing with ipforwading and nat iptabling.

Yes, network manager does not send a boot filename for the clients to boot with. NBD and NFS do that.

Richard

alkisg commented 5 years ago

Richard, I mean this:

In dual NIC mode there's no proxyDHCP. The LTSP server provides the clients with an IP address and a boot filename. The boot filename in LTSP5 was "pxelinux.0", now it's more complicated as it involves 3 different files for BIOS/UEFI/IPXE.

The boot filename is not related to NFS or NBD. It's the first thing that the clients get via TFTP when they boot. The LTSP file that sends that filename is /etc/dnsmasq.d/ltsp-server-dnsmasq.conf in LTSP5, and /etc/dnsmasq.d/ltsp-dnsmasq.conf in newer versions.

Now to the problem. When you enable connection sharing in network manager, you instruct network manager to run a child dnsmasq process of its own! This process then is yet another DHCP server, which gives IP addresses to the clients, but without any special dnsmasq.conf configuration to give them the pxelinux.0 etc filename. This is why the clients fail to boot some times in your tests.

So why is it working most of the times? Because that child dnsmasq fails to run, as the LTSP dnsmasq is already running in that interface.

That means that using network manager connection sharing isn't an appropriate solution for LTSP. It only works if LTSP dnsmasq is started before network-manager dnsmasq (i.e. race condition), and even then, it only works because network-manager has a bug and doesn't check if its child dnsmasq started successfully or not.

Network manager connection sharing could only be used for LTSP if it had an option to "not run dnsmasq as the user already runs dnsmasq there"; but I don't think it has such an option.

alkisg commented 5 years ago

If we agree that we should avoid relying on network manager and its additional dnsmasq service for connection sharing, since we already have one dnsmasq running that may conflict with the other one,

then I think we can move on and file a new issue and start the implementation of NAT inside the new LTSP code base?

I.e. ltsp dnsmasq --real-dhcp=1 (which is the default) would symlink a script in /etc/NetworkManager/dispatcher.d/ that would do NAT on 192.168.67.1, similar to what sch-scripts do.

Unfortunately this will only work when users use network manager and not e.g. ifupdown (/etc/network/interfaces), but ifupdown is getting deprecated and systemd-networkd doesn't (yet?) provide a method to run code when a network connection is established.

alkisg commented 5 years ago

I filed issue #13 about this; Richard I'll close this issue now as I think it's resolved, but we can still chat in it even if it's closed and of course you can reopen it if you think it still needs resolution.

rkwesk commented 5 years ago

Just to clarify: Using ltsp5 with a server with two nics, dnsmasq.conf was runnung dhcp-proxy on the wan but proper dhcp on the lan. In contrast using ltsp19 with a server with two nics dnsmasq does not run dhcp-proxy on the wan?

rkwesk commented 5 years ago

AFAIK in ltsp5 we looked at /etc/NetworkManager/NetworkManager.conf to see if there was a specific line with the dns= key we commented it out which was enough because the default was for network manager not to use dnsmasq.basic. Now I see you saying network manager will use dnsmasq.basic anyway.

alkisg commented 5 years ago

Just to clarify: Using ltsp5 with a server with two nics, dnsmasq.conf was runnung dhcp-proxy on the wan but proper dhcp on the lan. In contrast using ltsp19 with a server with two nics dnsmasq does not run dhcp-proxy on the wan?

From man ltsp dnsmasq:

-p, --proxy-dhcp=0|1
Enable or disable the proxy DHCP service. Defaults to 1. Proxy DHCP means that the LTSP
server  sends  the  boot  filename,  but  it  leaves the IP leasing to an external DHCP
server, for example a router or pfsense or a Windows DHCP server. It´s the easiest  way
to  set  up LTSP, as it only requires a single NIC with no static IP, no need to rewire
switches etc.

-r, --real-dhcp=0|1
Enable or disable the real DHCP service. Defaults to 1. In dual NIC  setups,  you  only
need  to  configure  the  internal NIC to a static IP of 192.168.67.1; LTSP will try to
autodetect everything else. The real DHCP service doesn´t take effect if your IP  isn´t
192.168.67.x,  so there´s no need to disable it in single NIC setups unless you want to
run isc-dhcp-server on the LTSP server.

I.e. both of them are enabled by default, and there are options to disable them.

alkisg commented 5 years ago

AFAIK in ltsp5 we looked at /etc/NetworkManager/NetworkManager.conf to see if there was a specific line with the dns= key we commented it out which was enough because the default was for network manager not to use dnsmasq.basic. Now I see you saying network manager will use dnsmasq.basic anyway.

There are 3 dnsmasq instances involved:

  1. The main dnsmasq that LTSP configures
  2. The one that network-manager spawns when you do connection sharing
  3. The one you just mentioned, that Ubuntu (but not Debian) spawned because they wanted to optimize local DNS; they aborted that when they adopted systemd-resolved

I was talking about "2" conflicting with "1" there, which we don't want.

alkisg commented 5 years ago

Just for clarity, these are the command lines of "1" and "2" as they show up in ps faux | grep dnsmasq. E.g. in the current Debian how-to, when "1" runs first, things are almost OK, but when "2" runs first, clients won't be able to boot. We don't care about "3", Ubuntu stopped shipping that.

This is the main dnsmasq that LTSP configures, "1":

dnsmasq 1340 0.0 0.0 18012 340 ? S 23:17 0:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,20326,8,2,e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d

This is the auxiliary dnsmasq for connection sharing, "2", that conflicts with the main one:

nobody 1236 0.0 0.1 18124 3848 ? S 23:13 0:00 /usr/sbin/dnsmasq --conf-file=/dev/null --no-hosts --keep-in-foreground --bind-interfaces --except-interface=lo --clear-on-reload --strict-order --listen-address=192.168.67.1 --dhcp-range=192.168.67.10,192.168.67.254,60m --dhcp-lease-max=50 --dhcp-leasefile=/var/lib/NetworkManager/dnsmasq-enp0s3.leases --pid-file=/run/nm-dnsmasq-enp0s3.pid --conf-dir=/etc/NetworkManager/dnsmasq-shared.d