Closed Brandonage closed 7 years ago
I'm using ruby ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]. Maybe is an issue with this version and net ssh library?
I tried again with a new ruby version and apparently only two machines failed with the errors below. Funny thing is that, although the output says that only two machines failed, there are a bunch of other machines that have an error. For example test-8
==> test-89: ready @rennes ==> test-89: Waiting for machine to boot. This may take a few minutes... ==> test-39: ready @rennes ==> test-39: Waiting for machine to boot. This may take a few minutes... ==> test-54: Waiting for the job to be running ==> test-8: Machine booted and ready! ==> test-91: Machine booted and ready! ==> test-51: Waiting for the job to be running
test-1 Running (g5k) test-2 Terminated (g5k) test-3 Running (g5k) test-4 Terminated (g5k) test-5 Running (g5k) test-6 Running (g5k) test-7 Running (g5k) test-8 Error (g5k) test-9 Running (g5k) test-10 Running (g5k) test-11 Running (g5k) test-12 Error (g5k)
An error occurred while executing multiple actions in parallel.
Any errors that occurred are shown below.
An unexpected error occurred when executing the action on the
'test-19' machine. Please report this as a bug:
Broken pipe - send(2)
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:148:in `send'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:148:in `send_packet'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:165:in `send_and_wait'
Host g5k
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:76:in `negotiate!'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/agent/socket.rb:48:in `connect'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/key_manager.rb:179:in `agent'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/key_manager.rb:103:in `each_identity'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/methods/publickey.rb:19:in `authenticate'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/session.rb:79:in `block in authenticate'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/session.rb:66:in `each'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh/authentication/session.rb:66:in `authenticate'
/opt/vagrant/embedded/gems/gems/net-ssh-3.0.2/lib/net/ssh.rb:229:in `start'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:397:in `block (2 levels) in connect'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:371:in `block in connect'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/util/retryable.rb:17:in `retryable'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:370:in `connect'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:68:in `block in wait_for_ready'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/opt/vagrant/embedded/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/plugins/communicators/ssh/communicator.rb:46:in `wait_for_ready'
/opt/vagrant/embedded/gems/gems/vagrant-1.9.3/lib/vagrant/action/builtin/wait_for_communicator.rb:16:in `block in call'
An error occurred while executing the action on the 'test-48'
machine. Please handle this error then try again:
Remote command error
Hi @Brandonage,
Interesting :) In general vagrant isn't designed to start a big amount of VMs (e.g they are many parts of the framework that make sequential actions on the instances) In your specific case it's difficult to know what is happening, can you share your Vagrantfile ?
Of course. Here's the vagrant file. I guessed it was not a good idea but it really simplifies deploying experiments. Are there any other alternatives?.
# -*- mode: ruby -*-
# vi: set ft=ruby :
#
# Sample Vagrantfile
#
Vagrant.configure(2) do |config|
config.vm.provider "g5k" do |g5k, override|
override.nfs.functional = false
g5k.project_id = "test-vagrant-g5k"
g5k.site = "rennes"
g5k.username = "abrandon"
g5k.gateway = "access.grid5000.fr"
g5k.walltime = "02:30:00"
g5k.image = {
:path => "/home/abrandon/public/centos_7.2_dcos.qcow2",
:backing => "snapshot"
}
g5k.net = {
:type => "bridge",
# :ports => ["#{2222+i}-:22"]
}
g5k.oar = "virtual != 'none'"
g5k.resources = {
:cpu => 2,
:mem => 4096
}
end #g5k
## This define a VM.
## a g5k provider section will override top level options
## To define multiple VMs you can
## * either repeat the block
## * loop over using (1..N).each block
(1..100).each do |i|
config.vm.define "test-#{i}" do |my|
my.vm.box = "dummy"
## Configure the shared folders between your host and the VM
my.vm.synced_folder ".", "/vagrant", type: "rsync", disabled: false
## This is mandatory until #6 is fixed
## In particular this is needed for the shared folders
my.ssh.insert_key = false
end #vm
end
end
Hello,
I managed to start 99 VMS, I just cleaned a bit the Vagrantfile (this shouldn't be related to your problem though :)).
I have :
╰─$ vagrant --version
Vagrant 1.9.1
╰─$ ruby --version
ruby 2.0.0p481
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure(2) do |config|
(1..100).each do |i|
config.vm.define "test-#{i}" do |my|
my.vm.box = "dummy"
my.vm.provider "g5k" do |g5k, override|
override.nfs.functional = false
override.vm.synced_folder ".", "/vagrant", type: "rsync", disabled: false
override.ssh.insert_key = false
g5k.project_id = "test-vagrant-g5k"
g5k.site = "rennes"
g5k.username = "msimonin"
g5k.gateway = "access.grid5000.fr"
g5k.walltime = "45:00:00"
g5k.image = {
:path => "/home/abrandon/public/centos_7.2_dcos.qcow2",
:backing => "snapshot"
}
g5k.net = {
:type => "bridge",
}
g5k.oar = "virtual != 'none'"
g5k.resources = {
:cpu => 1,
:mem => 2048
}
end #g5k
end
end
end
msimonin@frennes:~$ oarstat -u |grep "test-"|wc -l
100
As a follow up. When working with large number of VMs vagrant status
takes ages. I don't know yet if I can change the default behaviour of vagrant so that it gives the status faster thant it does (and avoid timeouts). In the previous message I checked the status 'manually' by issuing an oarstat -u
command directly on g5k frontend to check if all the jobs were runnning (wich means the VMs are running).
I'm not deploying from the frontend, but from my laptop, which could explain why I'm getting the errors. Additionally, yes, vagrant status is painfully slow. As a workaround I'm launching one script that executes vagrant up for each test machine that is in failed status.
I'm going to try next deploying from the frontend and see if it helps in reducing the errors.
@Brandonage you can't use vagrant-g5k
from the frontend :)
Ohh ok. oarstat line confused me :). Ok, I think this is more related with vagrant itself than with the vagrant-g5k. I suppose you can close this issue.
Thanks again for your help!
I'm experiencing an issue when deploying a big number of virtual machines with vagrant-g5k. I configured a VagrantFile to deploy 100 VM of centos with 4GB and 2 cores ( the image can be find in /home/abrandon/public/centos_7.2_dcos.qcow2 ). Some of the VM's are deploying fine but in some of them I get the attached below error. From the 100 I only get 34 machines
usuariop87:vagrant-g5k alvarobrandon$ vagrant status | grep Running | wc -l 34