Concurrency issue when nodes are brought up in parallel

electrofelix commented 7 years ago

It appears that when using parallel up of machines that landrush may attempt to start the DNS server multiple times, and consequently the wrong PID is recorded since the last started server's pid will be written to the pidfile instead of the running one.

This was confirmed by adding a simple:

puts "pid = #{pid}"

before the write_pid() call in lib/landrush/server.rb

The resulting output was:

[landrush] Using eth0 (172.17.0.5)
==> srv-1.ncs.net: [landrush] starting dns server
    srv-1.ncs.net: [landrush] 'ruby /home/baileybd/.vagrant.d/gems/gems/landrush-1.2.0/lib/landrush/start_server.rb 10053 /home/baileybd/.vagrant.d/data/landrush /home/baileybd/.vagrant.d/gems/gems'
==> srv-4.ncs.net: [landrush] adding machine entry: srv-4.ncs.net => 172.17.0.5#<Process::Waiter:0x007f5c881dd2d8>
pid = 19564
==> srv-5.ncs.net: [landrush] starting dns server
[landrush] Using eth0 (172.17.0.6)
    srv-1.ncs.net: [landrush] 'ruby /home/baileybd/.vagrant.d/gems/gems/landrush-1.2.0/lib/landrush/start_server.rb 10053 /home/baileybd/.vagrant.d/data/landrush /home/baileybd/.vagrant.d/gems/gems'
pid = 19571
==> srv-2.ncs.net: [landrush] adding machine entry: srv-2.ncs.net => 172.17.0.6
[landrush] Using eth0 (172.17.0.4)

Checking the processes for start_server.rb

ps auxww | grep landrush/start_server.rb |'{print $2}'
19564

However the process pid file:

cat ~/.vagrant.d/data/landrush/run/landrush.pid
19571

Therefore with parallel machine up's the code can be executed more than once, and this can result in the wrong pid being recorded, since it will be the last attempted start of the DNS server (which may fail due to the port being in use), resulting in the second pid being written out to the pidfile.

The main area responsible for this is the post_boot_setup action: https://github.com/vagrant-landrush/landrush/blob/1077e140d6b8bf783eed850e2174e8b84bb22275/lib/landrush/action/setup.rb#L35-L43

       def post_boot_setup
        record_dependent_vm
        configure_server
        record_machine_dns_entry
        setup_static_dns
        start_server
        return unless machine.config.landrush.host_redirect_dns?
        env[:host].capability(:configure_visibility_on_host, host_ip_address, config.tld)
      end

The calls configure_server, setup_static_dns & start_server should all only be executed once, by which ever machine thread reaches this code first. This would also solve the problem with failing to find the DNS server process on destroy.

hferentschik commented 7 years ago

Thanks for the bug report and pull request. Looking into it.

hferentschik commented 7 years ago

It appears that when using parallel up of machines that landrush may attempt to start the DNS server multiple times,

So you mean

vagrant up --parallel

electrofelix commented 7 years ago

@hferentschik exactly:

Looks like something has changed that prevents multiple instances of the server from being started without this patch, I'm using vagrant 1.8.7.

However I am seeing entries failing to be added to ~/.vagrant.d/data/landrush/hosts.json and missing entries from vagrant landrush list in certain cases, so although the server is now starting once only there are other problems in the configuration being written out that would be solved by this change.

Vagrantfile

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  config.vm.box = "debian/jessie64"

  config.landrush.enabled = true
  config.landrush.tld = 'ncs.net'
  config.landrush.host_redirect_dns = false
  # use my host resolvconf dnsmasq instance, probably not be required
  # for systems that can reach the google dns servers instead
  config.landrush.upstream '127.0.0.1'

  config.vm.provider :libvirt do |libvirt, override|
    override.vm.synced_folder ".", "/vagrant", type: "9p"

    libvirt.memory = 4096
    libvirt.cpus = 2
    libvirt.disk_bus = "scsi"
    libvirt.volume_cache = "unsafe"
    libvirt.nested = true
    libvirt.cpu_mode = "host-passthrough"
    libvirt.driver = "kvm"
    libvirt.host = "localhost"
    libvirt.uri = "qemu:///system"
  end

  (1..10).each do |n|
    config.vm.define "srv-#{n}" do |node|
      node.vm.hostname = "srv-#{n}.ncs.net"
    end
  end
end

vagrant landrush stop
rm -rf ~/.vagrant.d/data/landrush/
vagrant up --parallel --provider libvirt
... <lots of output>
vagrant landrush list
# produced the following output
193.121.168.192.in-addr.arpa   srv-3.test.net
srv-2.test.net                 192.168.121.124
124.121.168.192.in-addr.arpa   srv-2.test.net
srv-9.test.net                 192.168.121.188
188.121.168.192.in-addr.arpa   srv-9.test.net
srv-6.test.net                 192.168.121.186
186.121.168.192.in-addr.arpa   srv-6.test.net
srv-7.test.net                 192.168.121.240
240.121.168.192.in-addr.arpa   srv-7.test.net
srv-5.test.net                 192.168.121.46
46.121.168.192.in-addr.arpa    srv-5.test.net

Bringing the set up and down without removing the landrush data dir results in more getting populated, but it shows that there is a race in the configuration getting written out.

And I also wouldn't be too confident that the multiple attempts to start the daemon won't return unless the reason for it starting to behave itself can be tracked down as it's likely just timing is allowing things to work better now.

hferentschik commented 5 years ago

Merged via pull request #328

vagrant-landrush / landrush

Concurrency issue when nodes are brought up in parallel #304