msimonin / vagrant-g5k

Hacking around vagrant and g5k
MIT License
3 stars 1 forks source link

Using vagrant-g5k with a big number of machines #8

Closed Brandonage closed 7 years ago

Brandonage commented 7 years ago

I'm experiencing an issue when deploying a big number of virtual machines with vagrant-g5k. I configured a VagrantFile to deploy 100 VM of centos with 4GB and 2 cores ( the image can be find in /home/abrandon/public/centos_7.2_dcos.qcow2 ). Some of the VM's are deploying fine but in some of them I get the attached below error. From the 100 I only get 34 machines

usuariop87:vagrant-g5k alvarobrandon$ vagrant status | grep Running | wc -l 34

An unexpected error occurred when executing the action on the
'test-71' machine. Please report this as a bug:

closed stream

/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/ruby_compat.rb:25:in `select'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/ruby_compat.rb:25:in `io_select'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/transport/packet_stream.rb:75:in `available_for_read?'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/transport/packet_stream.rb:87:in `next_packet'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/transport/session.rb:185:in `block in poll_message'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/transport/session.rb:180:in `loop'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/transport/session.rb:180:in `poll_message'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/connection/session.rb:471:in `dispatch_incoming_packets'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/connection/session.rb:222:in `preprocess'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/connection/session.rb:206:in `process'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/connection/session.rb:170:in `block in loop'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/connection/session.rb:170:in `loop'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/connection/session.rb:170:in `loop'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/driver.rb:38:in `block in exec'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/driver.rb:17:in `synchronize'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/driver.rb:17:in `exec'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/disk/local.rb:40:in `check_storage'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/g5k_connection.rb:81:in `check_storage'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/action/read_state.rb:23:in `read_state'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/action/read_state.rb:15:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:34:in `call'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/action/connect_g5k.rb:36:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:34:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/builtin/config_validate.rb:25:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:34:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/builder.rb:116:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/runner.rb:66:in `block in run'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/util/busy.rb:19:in `busy'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/runner.rb:66:in `run'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:225:in `action_raw'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:200:in `block in action'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:182:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:182:in `block in action'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:186:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:186:in `action'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/provider.rb:32:in `state'
/Users/alvarobrandon/.vagrant.d/gems/2.2.5/gems/vagrant-g5k-0.9.4/lib/vagrant-g5k/action/run_instance.rb:28:in `recover'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:67:in `block in recover'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:64:in `each'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:64:in `recover'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/builtin/call.rb:61:in `recover'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:67:in `block in recover'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:64:in `each'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:64:in `recover'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:53:in `rescue in call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/warden.rb:28:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/builder.rb:116:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/runner.rb:66:in `block in run'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/util/busy.rb:19:in `busy'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/runner.rb:66:in `run'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:225:in `action_raw'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:200:in `block in action'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/environment.rb:567:in `lock'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:186:in `call'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/machine.rb:186:in `action'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/batch_action.rb:82:in `block (2 levels) in run'
Brandonage commented 7 years ago

I'm using ruby ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]. Maybe is an issue with this version and net ssh library?

Brandonage commented 7 years ago

I tried again with a new ruby version and apparently only two machines failed with the errors below. Funny thing is that, although the output says that only two machines failed, there are a bunch of other machines that have an error. For example test-8

==> test-89: ready @rennes ==> test-89: Waiting for machine to boot. This may take a few minutes... ==> test-39: ready @rennes ==> test-39: Waiting for machine to boot. This may take a few minutes... ==> test-54: Waiting for the job to be running ==> test-8: Machine booted and ready! ==> test-91: Machine booted and ready! ==> test-51: Waiting for the job to be running

test-1 Running (g5k) test-2 Terminated (g5k) test-3 Running (g5k) test-4 Terminated (g5k) test-5 Running (g5k) test-6 Running (g5k) test-7 Running (g5k) test-8 Error (g5k) test-9 Running (g5k) test-10 Running (g5k) test-11 Running (g5k) test-12 Error (g5k)

An error occurred while executing multiple actions in parallel.
Any errors that occurred are shown below.

An unexpected error occurred when executing the action on the
'test-19' machine. Please report this as a bug:

Broken pipe - send(2)

/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:148:in `send'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:148:in `send_packet'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:165:in `send_and_wait'
Host g5k
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:76:in `negotiate!'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:48:in `connect'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/key_manager.rb:179:in `agent'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/key_manager.rb:103:in `each_identity'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/methods/publickey.rb:19:in `authenticate'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/session.rb:79:in `block in authenticate'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/session.rb:66:in `each'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/session.rb:66:in `authenticate'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh.rb:229:in `start'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:397:in `block (2 levels) in connect'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:371:in `block in connect'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/util/retryable.rb:17:in `retryable'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:370:in `connect'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:68:in `block in wait_for_ready'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:46:in `wait_for_ready'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/builtin/wait_for_communicator.rb:16:in `block in call'

An error occurred while executing the action on the 'test-48'
machine. Please handle this error then try again:

Remote command error
msimonin commented 7 years ago

Hi @Brandonage,

Interesting :) In general vagrant isn't designed to start a big amount of VMs (e.g they are many parts of the framework that make sequential actions on the instances) In your specific case it's difficult to know what is happening, can you share your Vagrantfile ?

Brandonage commented 7 years ago

Of course. Here's the vagrant file. I guessed it was not a good idea but it really simplifies deploying experiments. Are there any other alternatives?.

# -*- mode: ruby -*-
# vi: set ft=ruby :
#
# Sample Vagrantfile
#
Vagrant.configure(2) do |config|

      config.vm.provider "g5k" do |g5k, override|
        override.nfs.functional = false
        g5k.project_id = "test-vagrant-g5k"
        g5k.site = "rennes"
        g5k.username = "abrandon"
        g5k.gateway = "access.grid5000.fr"
        g5k.walltime = "02:30:00"
        g5k.image = {
          :path    => "/home/abrandon/public/centos_7.2_dcos.qcow2",
          :backing => "snapshot"
        }
        g5k.net = {
          :type => "bridge",
#            :ports => ["#{2222+i}-:22"]
        }
        g5k.oar = "virtual != 'none'"
        g5k.resources = {
          :cpu => 2,
          :mem => 4096
        }
      end #g5k

    ## This define a VM.
    ## a g5k provider section will override top level options
    ## To define multiple VMs you can
    ## * either repeat the block
    ## * loop over using (1..N).each block
    (1..100).each do |i|
      config.vm.define "test-#{i}" do |my|
        my.vm.box = "dummy"
        ## Configure the shared folders between your host and the VM
        my.vm.synced_folder ".", "/vagrant", type: "rsync", disabled: false
        ## This is mandatory until #6 is fixed
        ## In particular this is needed for the shared folders
        my.ssh.insert_key = false
      end #vm
    end
end
msimonin commented 7 years ago

Hello,

I managed to start 99 VMS, I just cleaned a bit the Vagrantfile (this shouldn't be related to your problem though :)).

I have :

╰─$ vagrant --version                                                                                                                                                                                  
Vagrant 1.9.1
╰─$ ruby --version
ruby 2.0.0p481
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure(2) do |config|

    (1..100).each do |i|
      config.vm.define "test-#{i}" do |my|
        my.vm.box = "dummy"
        my.vm.provider "g5k" do |g5k, override|
          override.nfs.functional = false
          override.vm.synced_folder ".", "/vagrant", type: "rsync", disabled: false
          override.ssh.insert_key = false
          g5k.project_id = "test-vagrant-g5k"
          g5k.site = "rennes"
          g5k.username = "msimonin"
          g5k.gateway = "access.grid5000.fr"
          g5k.walltime = "45:00:00"
          g5k.image = {
            :path    => "/home/abrandon/public/centos_7.2_dcos.qcow2",
            :backing => "snapshot"
          }
          g5k.net = {
            :type => "bridge",
          }
          g5k.oar = "virtual != 'none'"
          g5k.resources = {
            :cpu => 1,
            :mem => 2048
          }
        end #g5k
      end
    end
end
msimonin@frennes:~$ oarstat -u |grep "test-"|wc -l
100
msimonin commented 7 years ago

As a follow up. When working with large number of VMs vagrant status takes ages. I don't know yet if I can change the default behaviour of vagrant so that it gives the status faster thant it does (and avoid timeouts). In the previous message I checked the status 'manually' by issuing an oarstat -u command directly on g5k frontend to check if all the jobs were runnning (wich means the VMs are running).

Brandonage commented 7 years ago

I'm not deploying from the frontend, but from my laptop, which could explain why I'm getting the errors. Additionally, yes, vagrant status is painfully slow. As a workaround I'm launching one script that executes vagrant up for each test machine that is in failed status.

I'm going to try next deploying from the frontend and see if it helps in reducing the errors.

msimonin commented 7 years ago

@Brandonage you can't use vagrant-g5kfrom the frontend :)

Brandonage commented 7 years ago

Ohh ok. oarstat line confused me :). Ok, I think this is more related with vagrant itself than with the vagrant-g5k. I suppose you can close this issue.

Thanks again for your help!