infochimps-labs / ironfan

Chef orchestration layer -- your system diagram come to life. Provision EC2, OpenStack or Vagrant without changes to cookbooks or configuration
http://infochimps.com
Other
501 stars 102 forks source link

knife cluster <cluster name> facet name --bootstrap fails #97

Closed Cindia-blue closed 12 years ago

Cindia-blue commented 12 years ago

I tried to launch mysql_demo cluster with below rb file in cluster dir:

ClusterChef.cluster 'mysql_demo' do
  cloud(:ec2) do
    defaults
    availability_zones ['us-east-1d']
    flavor              't1.micro'
    backing             'ebs'
    image_name          'natty'
  end

  role                  :base_role
  role                  :chef_client

  facet :sqlsever do
    instances           1
    role                :mysql_server

  end

  facet :sqlclient do
    instances           1
    role                :mysql_client

  end
end

when I run "knife cluster mysql_demo sqlsever --bootstrap" , got below message:

WARNING: Bad interval: mysql_demo-sqlsever-0 Nothing to report WARNING: No nodes to bootstrap, exiting

When show the cluster, you can see an sqlsever instance has been created but with "no" presented in chef column... seems no chef has been enabled on the newly created node...

Anything I should do to resolve this or provide more details for diagnosis

mrflip commented 12 years ago

There's no chef node because the bootstrap didn't run...

Can you try running knife cluster bootstrap mysql_demo sqlsever 0 ?

The error message for 'knife cluster mysql_demo sqlsever --bootstrap' is a bit confusing: you have the command in the wrong place. You should say knife cluster bootstrap mysql_demo-sqlsever-0

Also to note, you have spelt "sqlsever" without an 'r' -- is it possible you're having some conflict with 'sqlsever' and 'sqlserver' ?

Cindia-blue commented 12 years ago

if I input "knife cluster bootstrap mysql_demo sqlclsever 0" got ERROR: TypeError: can't convert nil into String

if I input "knife cluster bootstrap mysql_demo sqlsever mysql_demo-sqlsever-0" mysql_demo-sqlsever-0 is name the mysql server node. still got: WARNING: Bad interval: mysql_demo-sqlclient-0 @knife_common: display Nothing to report WARNING: No nodes to bootstrap, exiting

Yes, I mist -r if I changed the facet into sqlserver then run "knife cluster launch mysql_demo sqlserver" then input "knife cluster bootstrap mysql_demo sqlserver mysql_demo-sqlserver-0" got WARNING: Bad interval: mysql_demo-sqlserver-0 @knife_common: display Nothing to report WARNING: No nodes to bootstrap, exiting

mrflip commented 12 years ago

You're still typing in too many things. It's

knife cluster COMMAND CLUSTER [FACET] [SERVER_INDEXES]

# all in mysql_demo cluster sqlserver facet
knife cluster launch mysql_demo sqlserver  
knife cluster bootstrap mysql_demo sqlserver  

# the first node in mysql_demo cluster sqlserver facet
knife cluster launch mysql_demo sqlserver 0  
knife cluster bootstrap mysql_demo sqlserver 0

Here's what the robot thinks when you type the command you described:

knife cluster bootstrap mysql_demo sqlserver mysql_demo-sqlserver-0
knife cluster COMMAND   CLUSTER    FACET     OOPS_WANTED_A_NUMBER
Cindia-blue commented 12 years ago

Tried again, below is the message

root@ubuntu:~# knife cluster bootstrap mysql_demo sqlserver Inventorying servers in mysql_demo cluster, sqlserver facet, all servers Hello World +------------------------+----------+------------+-----------------+--------------+------------+-------+-----------+---------+--------------+----------+ | Name | Env | AZ | Created At | Private IP | InstanceID | Chef? | relevant? | State | Public IP | Flavor | +------------------------+----------+------------+-----------------+--------------+------------+-------+-----------+---------+--------------+----------+ | mysql_demo-sqlserver-0 | _default | us-east-1d | 20120118-093845 | 10.202.54.41 | i-8297ade0 | no | true | running | 23.20.20.207 | t1.micro | +------------------------+----------+------------+-----------------+--------------+------------+-------+-----------+---------+--------------+----------+

Running bootstrap on mysql_demo-sqlserver-0...

Bootstrapping the node redoes its initial setup -- only do this on an aborted launch. Are you absolutely certain that you want to perform this action? (Type 'Yes' to confirm) Yes

ERROR: TypeError: can't convert nil into String root@ubuntu:~#

pcn commented 12 years ago
+------------------------+----------+------------+-----------------+--------------+------------+-------+-----------+---------+--------------+----------+
| Name                   | Env      | AZ         | Created At      | Private IP   | InstanceID | Chef? | relevant? | State   | Public IP    | Flavor   |
+------------------------+----------+------------+-----------------+--------------+------------+-------+-----------+---------+--------------+----------+
| mysql_demo-sqlserver-0 | _default | us-east-1d | 20120118-093845 | 10.202.54.41 | i-8297ade0 | no    | true      | running | 23.20.20.207 | t1.micro |
+------------------------+----------+------------+-----------------+--------------+------------+-------+-----------+---------+--------------+----------+

I've noticed that the "can't convert nil into String" and similar issues come up a lot in cluster_chef, and they're difficult because they're saying "at some point I am expecting a but instead that thing returned an empty list. An exception is being raised, but the context may be lacking at the moment."

I re-formatted your output to point out what I think is the most important point. The "chef" column says "no" which means that there's no client registered. If you're using the code from the master branch, then this happens when the bootstrap doesn't complete IIRC. You should kill this ec2 instance, and re-try the cluster launch command again and it should get farther. In version 3 the client creation has been moved to earlier in the process.

In chef the client defines permissions, and the node defines attributes. In this case what's probably happening is that when chef tries to operate on a node, it notices that there is no node that is both part of the correct security group and has a client that is registered with the chef server, so it returns a "nil" value which then bombs out.

If you want more info with most of these, you can run

$ knife cluster -VV bootstrap mysql_demo sqlserver 

and that should provide a backtrace.

Hope that helps.

-Peter

Cindia-blue commented 12 years ago

Thanks, my chef cluster is 3.0.10, install from homebase. I tried hadoop_demo by input "" then got, the wrong is with ssh, right? any suggestion?

root@ubuntu:~/chef-repo# knife cluster -VV bootstrap hadoop_demo master 0 DEBUG: Using configuration from /root/.chef/knife.rb Inventorying servers in hadoop_demo cluster, master facet, servers 0 INFO: Loading cluster /root/chef-repo/homebase/clusters/hadoop_demo.rb DEBUG: Signing the request as root DEBUG: Sending HTTP Request via GET to 172.16.234.140:4000/search/client DEBUG: Signing the request as root DEBUG: Sending HTTP Request via GET to 172.16.234.140:4000/search/node DEBUG: Using fog to catalog all servers DEBUG: Using fog to catalog all volumes DEBUG: Volume paired: root on hadoop_demo-master-0 (vol-71e8d31c @ /dev/sda1) +------------+----------+----------------------+------------+-----------------+--------------+---------------+-------+------------+--------------+-----------+---------+-------------+----------+-------------+ | Elastic IP | Env | Name | AZ | Created At | Volumes | Private IP | Chef? | InstanceID | Image | relevant? | State | SSH Key | Flavor | Public IP | +------------+----------+----------------------+------------+-----------------+--------------+---------------+-------+------------+--------------+-----------+---------+-------------+----------+-------------+ | | _default | hadoop_demo-master-0 | us-east-1d | 20120119-142106 | vol-71e8d31c | 10.205.13.217 | no | i-30424f52 | ami-fd589594 | true | running | hadoop_demo | t1.micro | 50.17.32.29 | +------------+----------+----------------------+------------+-----------------+--------------+---------------+-------+------------+--------------+-----------+---------+-------------+----------+-------------+

Running bootstrap on hadoop_demo-master-0...

Bootstrapping the node redoes its initial setup -- only do this on an aborted launch. Are you absolutely certain that you want to perform this action? (Type 'Yes' to confirm) Yes

/usr/lib/ruby/gems/1.8/gems/cluster_chef-3.0.10/lib/cluster_chef/cloud.rb:82:in join': can't convert nil into String (TypeError) from /usr/lib/ruby/gems/1.8/gems/cluster_chef-3.0.10/lib/cluster_chef/cloud.rb:82:inssh_identity_file' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-knife-3.0.10/lib/chef/knife/knife_common.rb:120:in bootstrapper' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-knife-3.0.10/lib/chef/knife/knife_common.rb:130:inrun_bootstrap' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-knife-3.0.10/lib/chef/knife/cluster_bootstrap.rb:63:in perform_execution' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-3.0.10/lib/cluster_chef/server_slice.rb:23:ineach' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-3.0.10/lib/cluster_chef/server_slice.rb:23:in each' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-knife-3.0.10/lib/chef/knife/cluster_bootstrap.rb:62:inperform_execution' from /usr/lib/ruby/gems/1.8/gems/cluster_chef-knife-3.0.10/lib/chef/knife/generic_command.rb:56:in run' from /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/knife.rb:391:inrun_with_pretty_exceptions' from /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/knife.rb:166:in run' from /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/application/knife.rb:128:inrun' from /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/bin/knife:25 from /usr/bin/knife:19:in `load' from /usr/bin/knife:19

pcn commented 12 years ago

Yes, the problem now is with ssh'ing into the server - your client installation doesn't know which ssh key to use. I've actually never quite had this work as intended for me, and in the end I got a lot of help in https://github.com/infochimps/cluster_chef/issues/95 and have a patch that works so that I can specify the ssh key directory, the ssh keypair, and the cluster name as separate properties. You can try the patch listed there on your gem, and use this in your cluster definition:

  cloud do
    ssh_identity_dir    File.expand_path('~/.ssh/')
    backing             data['default_backing_store']
    image_name          data['default_release_flavor']
    flavor              data['default_instance_flavor']
    availability_zones  data['default_availability_zones']
    bootstrap_distro    data['default_bootstrap_template'] # 'ubuntu11.04-cluster_chef_knewton'
    keypair             data['keypair'] 
    data['default_security_group_list'].each do |g|
     security_group      "#{g}"
    end
  end

I'm getting my values from a data bag, but you can fill in those values by hand and get the correct result.

Until making the above changes I jumped through some hoops to get this to work. Some more documentation or an example of how this works at infochimps may make this clearer

Also, I don't think this is related to your problem, but if you're using ruby 1.8 you may want to use rvm (see http://beginrescueend.com) to set up a 1.9 environment for yourself. See https://github.com/infochimps/cluster_chef/issues/80 for what I ran into - this is mostly on the target node, but I think that I ran into the same kind of issue on the launching node at some point.

Also, I found I get better outcomes with chef-0.10.6. There was some problem with 0.10.8 that I can't recall at the moment.

Cindia-blue commented 12 years ago

Thanks for these suggestions. I set SSH attributes (keypair - "knife" and dir) then launch and bootstrap again. Then help to return master node when input "knife node list" but Chef column is still "no". I found the client pem is indeed under client_keys.

Below is a log, found bootstrap fails on authorization. If I preset password for user "ubuntu" , the connection will be built but the chef still "no"and body of log looks the same.

DEBUG: Using configuration from /root/.chef/knife.rb Inventorying servers in hadoop_demo cluster, master facet, servers 0 INFO: Loading cluster /root/chef-repo/homebase/clusters/hadoop_demo.rb DEBUG: Signing the request as root DEBUG: Sending HTTP Request via GET to 172.16.234.140:4000/search/client DEBUG: Signing the request as root DEBUG: Sending HTTP Request via GET to 172.16.234.140:4000/search/node DEBUG: Using fog to catalog all servers DEBUG: Using fog to catalog all volumes DEBUG: Volume paired: root on hadoop_demo-master-0 (vol-e57e7a88 @ /dev/sda1) +------------+----------+----------------------+------------+-----------------+--------------+--------------+-------+------------+--------------+-----------+---------+---------+----------+---------------+ | Elastic IP | Env | Name | AZ | Created At | Volumes | Private IP | Chef? | InstanceID | Image | relevant? | State | SSH Key | Flavor | Public IP | +------------+----------+----------------------+------------+-----------------+--------------+--------------+-------+------------+--------------+-----------+---------+---------+----------+---------------+ | | _default | hadoop_demo-master-0 | us-east-1d | 20120120-052840 | vol-e57e7a88 | 10.204.29.71 | no | i-30222b52 | ami-fd589594 | true | running | knife | t1.micro | 107.21.173.99 | +------------+----------+----------------------+------------+-----------------+--------------+--------------+-------+------------+--------------+-----------+---------+---------+----------+---------------+

Running bootstrap on hadoop_demo-master-0...

Bootstrapping the node redoes its initial setup -- only do this on an aborted launch. Are you absolutely certain that you want to perform this action? (Type 'Yes' to confirm) Bootstrapping Chef on ec2-107-21-173-99.compute-1.amazonaws.com DEBUG: Looking for bootstrap template in /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/knife/bootstrap DEBUG: Found bootstrap template in /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/knife/bootstrap DEBUG: Adding ec2-107-21-173-99.compute-1.amazonaws.com DEBUG: establishing connection to ec2-107-21-173-99.compute-1.amazonaws.com:22 DEBUG: connection established INFO: negotiating protocol version DEBUG: remote is SSH-2.0-OpenSSH_5.8p1 Debian-1ubuntu3' DEBUG: local isSSH-2.0-Ruby/Net::SSH_2.1.4 i686-linux' DEBUG: read 840 bytes DEBUG: received packet nr 0 type 20 len 836 INFO: got KEXINIT from server INFO: sending KEXINIT DEBUG: queueing packet nr 0 type 20 len 556 DEBUG: sent 560 bytes INFO: negotiating algorithms DEBUG: negotiated:

Finished! Current state: +------------+----------------------+----------+------------+-----------------+--------------+--------------+--------------+------------+-------+---------+---------+---------------+----------+ | Elastic IP | Name | Env | AZ | Created At | Volumes | Private IP | Image | InstanceID | Chef? | State | SSH Key | Public IP | Flavor | +------------+----------------------+----------+------------+-----------------+--------------+--------------+--------------+------------+-------+---------+---------+---------------+----------+ | | hadoop_demo-master-0 | _default | us-east-1d | 20120120-052840 | vol-e57e7a88 | 10.204.29.71 | ami-fd589594 | i-30222b52 | no | running | knife | 107.21.173.99 | t1.micro | +------------+----------------------+----------+------------+-----------------+--------------+--------------+--------------+------------+-------+---------+---------+---------------+----------+

pcn commented 12 years ago

You have a keypair in ec2 that chef cannot discover. This is what my patch fixes - it lets you have a keypair in AWS called, eg. "my_keypair", set "my_keypair" in your cloud properties, and have ~/.ssh/id_my_keypair be the private key.

From the name of the keypair - "knife" I have to ask if you've run ssh-keygen to create the keypair "knife", such that ~/.ssh/knife and ~/.ssh/knife.pub exist, and have you uploaded the ~/.ssh/id_knife.pub in the ec2 part of the AWS console?

Cindia-blue commented 12 years ago

Yes, the "knife" key I used is created by AWS so there is no public key under .ssh. For patch side, I update discovery.rb according to #95.

This time, I executed command of ssh-keygen locally and generated a keypair named "cluster " under local folder: ~/.ssh (generated two files named as cluster and cluster.pub), then import this pub from my AWS console. revised SSH attributes of keypair as "cluster" and dir as "~/.ssh" . Below is latest bootstrap log: I found cluster publickey is exactly examined this time but still failed... Is there anything wrong with my use of ssh-keygen? (input keypair name and password during the process)

DEBUG: Using configuration from /root/.chef/knife.rb Inventorying servers in hadoop_demo cluster, master facet, servers 0 INFO: Loading cluster /root/chef-repo/homebase/clusters/hadoop_demo.rb DEBUG: Signing the request as root DEBUG: Sending HTTP Request via GET to 172.16.234.142:4000/search/client DEBUG: Signing the request as root DEBUG: Sending HTTP Request via GET to 172.16.234.142:4000/search/node DEBUG: Using fog to catalog all servers DEBUG: Using fog to catalog all volumes DEBUG: Volume paired: root on hadoop_demo-master-0 (vol-afdfcac2 @ /dev/sda1) +------------+----------+----------------------+------------+-----------------+--------------+---------------+-------+------------+--------------+-----------+---------+---------+----------+--------------+ | Elastic IP | Env | Name | AZ | Created At | Volumes | Private IP | Chef? | InstanceID | Image | relevant? | State | SSH Key | Flavor | Public IP | +------------+----------+----------------------+------------+-----------------+--------------+---------------+-------+------------+--------------+-----------+---------+---------+----------+--------------+ | | _default | hadoop_demo-master-0 | us-east-1d | 20120125-013443 | vol-afdfcac2 | 10.245.74.151 | no | i-2b6d994e | ami-fd589594 | true | running | cluster | t1.micro | 50.17.139.25 | +------------+----------+----------------------+------------+-----------------+--------------+---------------+-------+------------+--------------+-----------+---------+---------+----------+--------------+

Running bootstrap on hadoop_demo-master-0...

Bootstrapping the node redoes its initial setup -- only do this on an aborted launch. Are you absolutely certain that you want to perform this action? (Type 'Yes' to confirm) Bootstrapping Chef on ec2-50-17-139-25.compute-1.amazonaws.com DEBUG: Looking for bootstrap template in /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/knife/bootstrap DEBUG: Found bootstrap template in /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/lib/chef/knife/bootstrap DEBUG: Adding ec2-50-17-139-25.compute-1.amazonaws.com DEBUG: establishing connection to ec2-50-17-139-25.compute-1.amazonaws.com:22 DEBUG: connection established INFO: negotiating protocol version DEBUG: remote is SSH-2.0-OpenSSH_5.8p1 Debian-1ubuntu3' DEBUG: local isSSH-2.0-Ruby/Net::SSH_2.1.4 i686-linux' DEBUG: read 840 bytes DEBUG: received packet nr 0 type 20 len 836 INFO: got KEXINIT from server INFO: sending KEXINIT DEBUG: queueing packet nr 0 type 20 len 556 DEBUG: sent 560 bytes INFO: negotiating algorithms DEBUG: negotiated:

Finished! Current state: +------------+----------------------+----------+------------+-----------------+--------------+---------------+--------------+------------+-------+---------+---------+--------------+----------+ | Elastic IP | Name | Env | AZ | Created At | Volumes | Private IP | Image | InstanceID | Chef? | State | SSH Key | Public IP | Flavor | +------------+----------------------+----------+------------+-----------------+--------------+---------------+--------------+------------+-------+---------+---------+--------------+----------+ | | hadoop_demo-master-0 | _default | us-east-1d | 20120125-013443 | vol-afdfcac2 | 10.245.74.151 | ami-fd589594 | i-2b6d994e | no | running | cluster | 50.17.139.25 | t1.micro | +------------+----------------------+----------+------------+-----------------+--------------+---------------+--------------+------------+-------+---------+---------+--------------+----------+

pcn commented 12 years ago
ubuntu@ip-10-212-113-185:/etc/chef$ sudo chmod 777 *
ubuntu@ip-10-212-113-185:/etc/chef$ vi client.rb
ubuntu@ip-10-212-113-185:/etc/chef$ chef-client

Why did you do this? You should be running chef as root, so you should be doing two things:

1) Kill the currently running chef-client (ps -ef | grep chef-client to find the pid) 2) Run sudo chef-client --once

otherwise it looks like you're pretty much in the realm of having a working chef implementation, and you just need to work with getting and tweaking the right cookbooks now.

pcn commented 12 years ago

If you are behind any kind of firewall or NAT you won't be able to contact the chef server from the ec2 instance. You may want to start with the free tier of the opscode hosted chef for your test.

Cindia-blue commented 12 years ago

Just installed ruby 1.9.2 and wondering how to downgrade chef to 0.10.6... Should I uninstalled the old gems and restart from installing of chef gem and chef server? It could be very nice if there would be a way for me to elegantly achieve this. Thanks

In another way, I create a new ubuntu server and built it from the very beginning: this time install 1.9.3 and chef-0.10.6. Then install chef server by solo, which does run up normally with generation of validation.pem. Please help share a stable env stack for chef cluster install and dev. Thanks

pcn commented 12 years ago

Installing/uninstalling gems can be done via "gem install" and "gem uninstall" (which you already know, but that leads to) and you can also tell rubygems to stick packages at certain versions:

gem install chef --version "= 0.10.6" --no-ri --no-rdoc

Regarding the server, for now I'm also using hosted chef so that I don't have to deal with server issues. You may want to start with that, and then build your own server after validating that cluster_chef is working for you.

Cindia-blue commented 12 years ago

Thanks for these suggestions. I set up a suit of cluster chef 3.0.12 on server with bootstrap through. Thanks.