Issue #304 Improve concurrency behaviour for multi-machine envs

electrofelix commented 7 years ago

Use class variables to support configuration of the DNS server once per run for multi-machine environments.

Process level locking is done using a class variable containing a Mutex and a simple synchronization block, followed by testing whether another thread has already performed this configuration or not.

This allows a single attempt to bring up the DNS server, and ensures that for parallel bring up with multi-machine environments that one a single attempt is made to start the server, and therefore the correct pid will be available during destroy to cleanup.

Fixes #304

electrofelix commented 7 years ago

Anyone able to comment on whether this is considered acceptable? Any other alternatives that would work as a locking mechanism across threads in vagrant better than using a lock declared as a class variable?

hferentschik commented 7 years ago

@electrofelix, could you post a Vagrantfile with an example config?

electrofelix commented 7 years ago

Apologies seems I posted the example config on the bug rather than on the PR. Copying here.

Looks like something has changed that now prevents multiple instances of the server from being started without this patch, I'm using vagrant 1.8.7.

However I am seeing entries failing to be added to ~/.vagrant.d/data/landrush/hosts.json and missing entries from vagrant landrush list in certain cases, so although the server is now starting once only there are other problems in the configuration being written out that need to be looked at as record_machine_dns_entry calls to Store appear to also not be thread safe.

Vagrantfile

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  config.vm.box = "debian/jessie64"

  config.landrush.enabled = true
  config.landrush.tld = 'ncs.net'
  config.landrush.host_redirect_dns = false
  # use my host resolvconf dnsmasq instance, probably not be required
  # for systems that can reach the google dns servers instead
  config.landrush.upstream '127.0.0.1'

  config.vm.provider :libvirt do |libvirt, override|
    override.vm.synced_folder ".", "/vagrant", type: "9p"

    libvirt.memory = 4096
    libvirt.cpus = 2
    libvirt.disk_bus = "scsi"
    libvirt.volume_cache = "unsafe"
    libvirt.nested = true
    libvirt.cpu_mode = "host-passthrough"
    libvirt.driver = "kvm"
    libvirt.host = "localhost"
    libvirt.uri = "qemu:///system"
  end

  (1..10).each do |n|
    config.vm.define "srv-#{n}" do |node|
      node.vm.hostname = "srv-#{n}.ncs.net"
    end
  end
end

vagrant landrush stop
rm -rf ~/.vagrant.d/data/landrush/
vagrant up --parallel --provider libvirt
... <lots of output>
vagrant landrush list
# produced the following output
193.121.168.192.in-addr.arpa   srv-3.test.net
srv-2.test.net                 192.168.121.124
124.121.168.192.in-addr.arpa   srv-2.test.net
srv-9.test.net                 192.168.121.188
188.121.168.192.in-addr.arpa   srv-9.test.net
srv-6.test.net                 192.168.121.186
186.121.168.192.in-addr.arpa   srv-6.test.net
srv-7.test.net                 192.168.121.240
240.121.168.192.in-addr.arpa   srv-7.test.net
srv-5.test.net                 192.168.121.46
46.121.168.192.in-addr.arpa    srv-5.test.net

Bringing the set up and down without removing the landrush data dir results in more getting populated, but it shows that there is a race in the configuration getting written out.

This does appear to be somewhat dependent on the host as I've had better results with some machines over the others, with the above being the worst case I've seen.

And I also wouldn't be too confident that the multiple attempts to start the daemon won't return unless the reason for it starting to behave itself can be tracked down as it's likely just timing is allowing things to work better now.

I've also seen this fail to remove entries after some destroys due to one of the entries being missing, the corresponding reverse lookup entry is not correctly removed.

Thinking about this further suggests that in order to ensure that should someone also be executing multiple vagrant instances, and edge case to be sure but, since the same fix of using a lockfile around updates for any Store modifications and attempts to start the server would ensure any parallel bring up would be both multi-thread and multi-process safe.

hferentschik commented 5 years ago

I looked a bit at this. This would really only help for the vagrant up --parallel case right. Locking works across threads, but not across processes. Potentially one could imagine to independent parallel startups of Vagrant machines. For that reason, I am wondering whether one should not just synchronize on the config file directly using something like Ruby's File.flock or to get a bit of a higher level abstraction the filelock gem.

That said, I have a hard time reproducing the problem. Even with --parallel I don't get into a situation where Landrush config/start breaks due to sync issues. I am using:

Vagrant.configure(2) do |config|
  config.vm.box = 'ubuntu/xenial64'

  config.landrush.enabled = true
  config.landrush.tld = 'example.com'

  config.landrush.host 'static1.example.com', '1.2.3.4'
  config.landrush.host 'static2.example.com', '2.3.4.5'

  (1..3).each do |n|
    config.vm.define "landrush-#{n}" do |node|
      node.vm.hostname = "landrush-#{n}.example.com"
    end
  end
end

That said, synchronizing access is probably a good idea regardless.

hferentschik commented 5 years ago

Closing this pull request which is superseded by pull request #328

vagrant-landrush / landrush

Issue #304 Improve concurrency behaviour for multi-machine envs #305