mrflip / big_data_for_chimps-code

Docker files for the example code in Big Data for Chimps
1 stars 12 forks source link

Big Data for Chimps Example Code

The first step is to clone this repo:

git clone --recursive http://github.com/mrflip/big_data_for_chimps-code.git bd4c-code

TODO: change the git address when we move the repo

You will now see a directory called bd4c-code

Everything below (apart from one quick step) should take place in the bd4c-code/cluster/ directory. DO NOT USE THE bd4c-code/docker/ DIRECTORY -- that is for generating the docker containers, and you will want to use the pre-validated ones to start off.

Dockering I: Preliminaries

Prerequisites

Running under boot2docker

Port Forwarding

By forwarding selected ports from the Boot2Docker VM to the OSX host, you'll be able to ssh directly to the machines from your regular terminal window, and will be able to directly browse the various web interfaces. It's highly recommended, but you need to pause the boot2docker VM for a moment to accomplish this. Let's do that now before we dive in:

boot2docker down
rake docker:open_ports

You're going to need a bigger boat.

While you have the VM down, you should also increase the amount of memory you're allocating to the VM. In VirtualBox manager, select your boot2docker-vm and hit 'Settings'. Under the System tab, you will see the base memory slider -- adjust that to at least 4GB, but not higher than 30-50% of the physical ram on your machine.

The default 20GB virtual hard drive allocated for boot2docker will be a bit tight, but it's a pain in the butt to resize so might as well wait until it's a problem

DOCKER_HOST and DOCKER_IP environment variables

Bring boot2docker back up with

boot2docker up

When you run boot2docker up, it will either tell you that you have the env variable set already (hooray) or else tell you the env variable to set. You should not only set its value in your current terminal session, you should add it and a friend to your .bashrc file. The address will vary depending on circumstances, but using the current value on my machine I have added these lines:

export DOCKER_IP=192.168.59.103
export DOCKER_HOST=tcp://$DOCKER_IP:2375

The DOCKER_IP variable isn't necessary for docker, but it will be useful for working at the commandline -- when we refer to $DOCKER_IP in the following we mean just that bare IP address of the docker<->host bridge.

Install the script dependencies

We'll need a couple common dependencies for the scripts we'll use. Using a reasonably modern version of ruby (> 1.9.2, > 2.0 preferred):

gem install bundler
bundle install
rake ps

If your ruby environment is good, the last command will give similar output to running docker ps -a.

Pull in the containers

The first step will be to pre-seed the containers we'll use. This is going to bring in more than 4 GB of data, so don't do this at a coffee shop, and do be patient.

rake images:pull

You can do the next step while that proceeds.

Minor setup needed on the docker host

The namenode insists on being able to resolve the hostnames of its clients -- something that is far more complex in Dockerland than you'd think. We have a pretty painless solution, but it requires a minor intervention

On the docker host (boot2docker ssh, or whatever else it takes):

boot2docker ssh                          # or however you get onto the docker host
mkdir -p          /tmp/bulk/hadoop       # view all logs there
sudo touch        /var/lib/docker/hosts  # so that docker-hosts can make container hostnames resolvable
sudo chmod 0644   /var/lib/docker/hosts
sudo chown nobody /var/lib/docker/hosts

Leave a terminal window open on the docker host, as we'll do a couple more things over there.

Wait until the pull completes

Don't proceed past this point until the rake images:pull has succeeded. Time for some rolly-chair swordfighting!

Dockering II: Start it Up!

Preliminaries Complete!

You're ready to proceed when:

Alright! Now the fun starts.

Start the helpers cluster

The helpers cluster holds the gizmo that will socialize hostnames among all the containers, so we will bring it up first.

rake helpers:run

If everything works, these things will be true:

Instantiate the data containers

First we will lay down a set of data-only containers. These wonderful little devices will make the cluster come to life fully populated with data on both the HDFS and local filesystem.

rake data:create show_output=true

A torrent of filenames will fly by on the screen as the containers copy data from their internal archive onto the shared volumes the cluster will use. data_gold, the filesystem-local version of the data, will have directories about sports, text, airlines and ufos. data_outd, for output data, will be empty (that's your job, to fill it). data_hdfs0 will be a long streak of things in current/ with large integers in their name. The contents of data_nn are tiny but so-very-precious: it's the directory that makes sense of all those meaningless filenames from the data node. Lastly, the home_chimpy volume will have a lot of git and pig and ruby and asciidoc files. It's what you paid the big bucks for right there: the code and the book.

At this point, running rake ps will show five containers, all in the stopped stated. Wait, what? Yes, these are supposed to be in the stopped state -- all they do is anchor the data through docker magic. That also means they don't appear if you run docker ps -- you have to run docker ps -a to see them (that's why we tell you to run rake ps, which includes this flag by default).

Run the cluster

You've laid the groundwork. You've been introduced. Now you're ready to run the compute containers:

rake hadoop:run

Running rake ps will now show 12 containers: one helper, the five data containers just seen, plus

Hue Web console.

The friendly Hue console will be available at http://DOCKER_IP:9001/ in your browser (substitute the ip address of your docker). The login and password are 'chimpy' and 'chimpy'. (Ignore any whining it does about Oozie or Pig not working -- those are just front-end components we haven't installed)

SSH terminal access

You will also spend some time commanding a terminal directly on the machine. Even if you're not one of the many people who prefer the commandline way, in the later chapters you'll want to peek under the covers of what's going on within each of the machines. SSH across by running

ssh -i insecure_key.pem chimpy@$DOCKER_IP -p 9022

All of the nodes in the cluster are available for ssh. Using the normal SSH port of 22 as a mnemonic, we've set each container up in ascending centuries:

9222 is reserved for a second worker, if you have the capacity.

The dangerous thing we did that you need to know about

We've done something here that usually violates taste, reason and safety: the private key that controls access to the container is available to anyone with a browse. To bring that point home, the key is named insecure_key.pem. Our justification is that these machines are (a) designed to work within the private confines of a VM, without direct inbound access from the internet, and (b) are low-stakes playgrounds with only publicly redistributable code and data sets. If either of those assumptions becomes untrue -- you are pushing to the docker cloud, or using these machines to work with data of your own, or whatever -- then we urge you to construct new private/public keypairs specific only to each machine, replacing the /root/.ssh/authorized_keys and /home/chimpy/.ssh/authorized_keys files. (It's those latter files that grant access; the existing key must be removed and a new one added to retain access.) It's essential that any private keys you generate be unique to these machines: it's too easy to ship a container to the wrong place or with the wrong visibility at the current maturity of these tools. So don't push in the same key you use for accessing work servers or github or docker or the control network for your secret offshore commando HQ.

I WANT TO SEE DATA GO, YOU PROMISED

Right you are. There's tons of examples in the book, of course, but let's make some data fly now and worry about the details later.

See pig in action

On hadoop:

cd book/code/
  # This file isn't on the HDFS right now, so put it there:
hadoop fs -mkdir -p /data/gold/geo/ufo_sightings
hadoop fs -put      /data/gold/geo/ufo_sightings/ufo_sightings.tsv.bz2 /data/gold/geo/ufo_sightings/ufo_sightings.tsv.bz2
  # Run, pig, run!
pig -x mapred 04-intro_to_pig/a-ufo_visits_by_month.pig
  # See the output:
hadoop fs -cat /data/outd/ufos/sightings_hist/\* > /tmp/sightings_hist.tsv
  # Whadday know, they're the same!
colordiff -uw /data/outd/ufos/sightings_hist-reference.tsv /tmp/sightings_hist.tsv && echo 'No diffference'

Locally!

  # execute all examples from the code directory (i.e. not the one holding the file)
  # also note that at this moment you are running someting in ~book/code (book repo) and not ~/code
cd book/code
  # Need to remove the output directory -- check that there's nothing in it, then remove it
ls /data/outd/ufos/sightings_hist
rm -rf /data/outd/ufos/sightings_hist
  # Run, pig, run
pig -x local 04-intro_to_pig/a-ufo_visits_by_month.pig
   # Look ma, just what we predicted!
colordiff -uw /data/outd/ufos/sightings_hist{-reference.tsv,/part*} && echo 'No diffference'

Troubleshooting

The rake tasks are just scripts around the docker command, and print each command they execute before running them, and again afterwards if the command failed.

Checklist

Is your commandline environment complete and correct?

Check that you know where you are:

check that docker is happy:

Check that ruby is happy:

Check that your gems are installed correctly:

Check that rake and the rakefile are basically sane:

Do you have the right images?

Is the helpers cluster running?

If not, the docker-hosts project is at https://github.com/blalor/docker-hosts

Citizens of the future: it's quite likely that docker has evolved a superior solution to the hostnames problem, and so this may be the cause and not solution of a conflict.

If you can't get the helpers cluster running correctly, you can instead update the /etc/hosts file on each container.

Here is what mine looks like right now, with a single worker running:

127.0.0.1   localhost   localhost4
172.17.0.107    host-filer
172.17.0.119    nn
172.17.0.120    snn
172.17.0.121    rm
172.17.0.122    worker00
172.17.0.123    lounge
::1 localhost   localhost6  ip6-localhost   ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Those will not be the actual IP addresses -- there are instructions for finding them, below.

What matters most is that on the namenode, all worker IPs resolve to hostnames and vice-versa:

chimpy@nn:~$ getent hosts 172.17.0.122
172.17.0.122    worker00
chimpy@nn:~$ getent hosts worker00
172.17.0.122    worker00
chimpy@nn:~$ getent hosts google.com
173.194.115.73  google.com
173.194.115.69  google.com

Are the data volumes in place?

Is the correct data present?

These totals will probably have changed somewhat since the last edit of the readme, but the relative sizes should resemble the above

Access the lounge

Is the HDFS working?

The direct namenode console at http://$DOCKER_IP:50070/dfshealth.html#tab-overview should open and returns content. If so, the namenode is working and you can access it.

Safemode is off.
92 files and directories, 69 blocks = 161 total filesystem object(s).
Heap Memory used 96.32 MB of 160.5 MB Heap Memory. Max Heap Memory is 889 MB.
Non Heap Memory used 34.32 MB of 35.44 MB Commited Non Heap Memory. Max Non Heap Memory is 130 MB.
DFS Used:   209.48 MB
Non DFS Used:   31.35 GB
DFS Remaining:  25.07 GB
DFS Remaining%: 44.27%
Live Nodes  1 (Decommissioned: 0)
Dead Nodes  0 (Decommissioned: 0)

If SSH or web access works from the docker machine but not from its host machine, port forwarding is probably not set up correctly.

Is the Resource Manager working?

Datanode working?

On the worker machine:

Troubleshooting

Example Straight-Hadoop job

If the machines seem to be working, and the daemons seem to be running, this is a test of whether Hadoop works

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1 100000

Docker stuff

rake -P will list all the things rake knows how to do

    docker run                    \
      -p 9122:22 -p 8042:8042 -p 50075:50075      \
      -v /tmp/bulk/hadoop/log:/bulk/hadoop/log:rw \
      --volumes-from /data_hdfs0                  \
      --link hadoop_rm:rm --link hadoop_nn:nn     \
      --rm -it bd4c/hadoop_worker             \
      --name hadoop_worker.tmp

Halp my docker disk is full

The combined size of all the compute images (baseimage, hadoop_base, hadoop_nn, hadoop_snn, hadoop_rm, hadoop_worker, hadoop_lounge) is a bit under 3GB -- all of the latter are built from hadoop_base, and so re-use the common footprint of data.

The data volumes take up about 1-2GB more. These are representative sizes:

Filesystem                Size      Used Available Use% Mounted on
rootfs                    5.2G    204.6M      5.0G   4% /
...
/dev/sda1                26.6G      4.0G     19.7G  15% /mnt/sda1/var/lib/docker/aufs

The rake docker:rmi_blank command will remove all images that are not part of any tagged image. If you are building and rebuilding containers, the number of intermediate layers from discarded early versions can start to grow; rake docker:rmi_blank removes those, leaving all the named layers you actually use.

If you have cleared out all the untagged images, and checked that logs and other foolishness isn't the problem, you might be falling afoul of a bug in current versions of docker (1.3.0). It leads to large numbers of dangling volumes -- github/docker issue #6534 has workarounds.