A portable Python workbench for data science, built with Vagrant, VirtualBox and Salt.
Wouldn't it be great if you could magic up a local IPython Notebook server, complete with SciPy, Pandas, Matplotlib, PyMC, scikit-learn, R and Octave integration, and all the usual goodness, and running the latest version of Python, just by typing one line?
vagrant up charmed34
And wouldn't it be great if you could do that from pretty much any machine, and know that you'd get the exact same environment every time?
Well, read on.
Snake Charmer provides an out-of-the-box workbench for data analysis, statistical modelling, machine learning, mathematical programming and visualization.
It is designed to be used primarily via IPython Notebook.
The environment is based on Ubuntu 12.04 and Python 3.4, with the following data science tools included. You are of course free to install any other Python or Ubuntu packages -- or anything else that fits your need.
Packages marked 'alpha' or 'dev' should be considered experimental, although in many cases they are largely problem-free. We will endeavour to discover and document any known issues here.
† Non-Python tools
‡ Non-Python tools usable via Python wrapper packages
You are, of course, free to remove or upgrade these packages via pip
or
apt-get
as usual, or experiment with additional ones. Please feel free to
send pull requests when you get another package working.
Coming soon: Other Python versions. Ubuntu 14.04 LTS.
Potential future additions include: Parakeet, pattern, CrossCat, BayesDB, ggplot, Bokeh, Blaze, numdifftools, PuLP, CVXPY, SysCorr, bayesian, PEBL, libpgm, BayesPy, BayesOpt, mpld3, Pylearn2, nimfa, py-earth, Orange, NeuroLab, PyBrain, annoy, Zipline, Quandl, BNFinder, Alchemy API, xlrd/xlwt, NetworkX, PyMVPA, OpenCV, boto, gbq, SQLite, PyMongo, mpi4py, Jubatus, and one or more Hadoop clients.
If you have suggestions for any other packages to add, please submit them by raising an issue.
Snake Charmer runs IPython and all the associated tools in a sandboxed virtual machine. It relies on Vagrant for creating and managing these, and VirtualBox for running them -- so please go and install those now.
Experienced users of other virtualization platforms can edit the Vagrantfile to use one of these instead, if they prefer.
Windows users: you may also need to install an
ssh client
if you don't already have one on your system. Also, one Windows user
reports
that you might need to move git
out of your path if Vagrant does not run correctly.
Everything else is installed automatically.
By default, the VM allocates 2048MB of RAM, and a single CPU. If your computer only has 2048MB, you will need to reduce this as described in the customization guide. (If you reduce it too much, however, the VM may not build correctly. Some experimentation may be required.)
It is very much recommended that you run Snake Charmer on a machine with hardware virtualization (i.e. Intel VT-x or AMD-V). It will run on systems without this, but slowly. You may need to enable it in your BIOS, but most modern systems come with it enabled by default.
Check out this git repository:
# Either via ssh...
git clone git@github.com:andrewclegg/snake-charmer.git
# Or via https...
git clone https://github.com/andrewclegg/snake-charmer.git
# Then change into your new Snake Charmer directory...
cd snake-charmer
Start the VM:
vagrant up charmed34
If you're already a Vagrant user, be aware that Snake Charmer's Vagrantfile will attempt to install the vagrant-vbguest plugin automatically.
This command will generally take at least an hour to download and install all the necessary software. When this completes, it will run some tests and then display a message like this:
Your VM is up and running: http://localhost:8834/tree
Later rebuilds will go slightly more quickly, as downloaded package files are cached where possible.
Note: you may get the following warnings after the "up and running" message:
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/dev/null' mode='w' encoding='ISO-8859-1'>
/usr/local/lib/python3.4/dist-packages/numpy/lib/utils.py:134: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
warnings.warn(depdoc, DeprecationWarning)
These can safely be ignored.
The link shown at the end of the installation procedure will take you to a fully-kitted-out IPython Notebook server. This can be used in exactly the same way as if it was running on your 'real' hardware (i.e. not in a VM). It runs as a Linux service which starts automatically when the VM is booted up, and will be restarted automatically if it ever crashes.
The notebook server acts as if it is running in the 'notebooks' subdirectory (see Folder structure below).
Open the "Hello World" notebook to see a full list of installed packages and
other system information. N.B. The notebook server is started with inline
graphics enabled for matplotlib, but not the --pylab
option, as this is
considered harmful.
There is also a "Snake Charmer QA" notebook supplied. This allows you to run the test suites of the major components, but don't run this now! It's a slow process and only needs to be performed if you've customized your VM. See the customization guide for more information.
On a VM that's already been fully configured, vagrant up
will just restart
it, without going through the full install process.
You can log into the server via
vagrant ssh charmed34
from the same directory, for full command-line control. It's an Ubuntu 12.04 box, under the covers. But you can do most things through the IPython Notebook anyway, so this is rarely essential.
Some more useful commands:
vagrant reload charmed34 # reboot the VM (same as "vagrant up" if it's not running)
vagrant halt charmed34 # shut down the VM, reclaim the memory it used
vagrant destroy charmed34 # wipe it completely, reclaiming disk space too
vagrant suspend charmed34 # 'hibernate' the machine, saving current state
vagrant resume charmed34 # 'unhibernate' the machine
See the Vagrant docs for more details.
The notebook server runs from within the notebooks
subdirectory of the
current snake-charmer
directory, and initially contains a single "Hello
World" notebook.
Snake Charmer uses IPython 2 so any subdirectories of notebooks
will be
visible and navigable as folders in the IPython web interface. However, you
can't actually create directories from the web interface yet, so you'd need
to log in via ssh, or just enter a shell command into IPython with !
.
Vagrant sets up a number of synced folders, which are directories visible to
both the VM and the host (your computer). Files placed in these will be visible
to both the VM and the host, so this is a good way to make data available to
the VMs. If you create more than one VM (feature coming soon!), files in synced
folders will be visible to all of them -- apart from /srv/log
which is
specific to one VM only.
The paths in the left-hand column are relative to the snake-charmer
install
directory -- your local copy of this repo.
Folder on your computer Folder within VM Contents
------------------------ ----------------------- --------
notebooks /home/vagrant/notebooks Any notebooks
data /home/vagrant/data Data you wish to share (initially empty)
.cache /srv/cache Cache for downloaded files
log/charmed34 /srv/log Certain setup logs, useful for debugging only
salt/roots/salt /srv/salt Config management information (ignore this)
salt/roots/pillar /srv/pillar Config management information (ignore this)
These are all configurable via environment variables -- see the customization guide.
If you get your VM into a mess somehow, you can just type
vagrant destroy charmed34
vagrant up charmed34
to build a new one. Files in synced folders will not be affected if you do this, so you won't lose any data or notebooks. However, any data stored on the VM but outside these synced folders will be lost.
The virtual disk on each VM is configured with an 80GB limit -- it grows to take up real disk space on the host up to this limit, and then stops. But data stored in synced folders does not count towards this. So you will likely never reach the 80GB limit.
If you want to make another folder available to the VM, for example if your datasets are stored on another disk, see the customization guide.
The issues list is generally up to date with any current unresolved problems.
If you get an error during installation along the lines of:
Gem::Installer::ExtensionBuildError: ERROR: Failed to build gem native extension.
then you may be missing a C/C++ compiler, and/or libxml2
and libxslt
.
On .deb
-based Linuxes like Debian, Ubuntu and Mint, try this:
sudo apt-get install libxml2-dev libxslt-dev build-essentials
If you are on Linux and see lots of temporary files and directories with
random names like d20140819-2623-folctt
, vagrant20140813-28997-ya0isv2
or vagrant20140813-28997-ya0isv2.lock
appearing, this may be because Ruby
doesn't like the permissions on your /tmp
directory.
Try this:
vagrant halt
chmod 1777 /tmp
Then delete all the offending files, and restart the VM.
This is a known Vagrant issue.
If a VM starts behaving strangely, the golden rule is: Don't waste time fixing it.
This may sound strange, but the advantage of Snake Charmer is that you can create a factory-fresh VM with almost no effort at all.
The first thing to try is to reboot the VM:
vagrant reload
Option two is reprovisioning the machine. This runs through the install process and ensures all required packages are installed. First, delete the package cache in case anything in there is messed up:
# On OS X or Linux:
rm -rf .cache
# Or on Windows:
rd /s /q .cache
# N.B. Make sure you're in the snake-charmer directory first!
Then reboot and reprovision:
vagrant reload --provision charmed34
If this doesn't fix the problem, then delete it completely, and recreate it:
vagrant destroy charmed34
vagrant up charmed34
Finally, you could try deleting your VirtualBox machines and Vagrant configuration files:
VBoxManage controlvm charmed34 poweroff
VBoxManage unregistervm charmed34 --delete
vagrant box remove charmed34
vagrant up charmed34
If this still doesn't fix it, you may have found a bug. Please open a Github issue describing it in as much detail as possible, preferably with instructions on how to reproduce it.
The VirtualBox admin GUI can of course be used to check on the status of VMs, inspect their hardware and network configuration, manually start or stop them, attach via the console, etc.
Only use the host filesystem to store data, notebooks etc. -- that is, the
data
and notebooks
folders which are synced to the VM. If you store files
in other places on a VM, they will be lost forever when you destroy it.
The exception is if you want to package up a new Vagrant box with the data and notebooks from your existing environment, e.g. to redistribute. In this case, all the files must be stored within the VM. This is an option for Vagrant power users, primarily.
Snake Charmer VMs are Vagrant VMs, and Vagrant VMs can be published, shared and remotely accessed via various mechanisms. This is discussed in the Snake Charmer FAQ.
Even if you don't know much about VirtualBox, Vagrant or Salt, you can customize your VMs in several ways -- and if you want to tinker with the configuration for these programs directly, the sky's the limit. See the separate customization guide.
See the separate Snake Charmer FAQ.
Developed by Andrew Clegg (Twitter: @andrew_clegg), tested at Pearson.
Thanks to the authors and contributors of all the world-class open source components included, whose hard work has made this possible.
Snake Charmer does not include bundled distributions of its components (Python, Ubuntu, Python libraries, other libraries and packages etc.). Rather, it provides a set of machine-readable instructions for obtaining these components from third-party open-source repositories. Please refer to each individual component's documentation for license details.
Snake Charmer itself is distributed under the Apache License:
Copyright 2014 Andrew Clegg
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.