Tabula

Tabula helps you liberate data tables trapped inside PDF files.

Why Tabula?

If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface:

{TODO: screenshot / screencast here}

Caveat: Tabula only works on text-based PDFs, not scanned documents.

Amazon EC2 AMI

An Amazon EC2 AMI image is provided to give you a chance to boot up a quick test server: ami-e895f081

You can find a simple how-to in docs/ami-install.md.

Caveats

Note the EC2 instance types and EC2 pricing. We’re not responsible for any costs this may incur.

Also, please note that this image is a development demo image and may not be secure. Using this AMI for mission-critical or sensitive documents is currently not recommended.

Manual Installation (OS X or Linux)

(Note: A comprehensive, mostly copy-and-paste set of instructions is available for OS X users that normally don't do Ruby development but are interested bootstrapping Tabula on their own computer: docs/osx-simple-bootstrap.md)

Install Ruby and JRuby. Tabula been tested with Ruby 1.9.3 and JRuby 1.7.3. We highly recommend using rbenv to manage your Ruby versions, as rvm is a bit finicky. (JRuby is required to interface with pdfbox, but native Ruby must also be used since ruby-opencv is a natively compiled extension.)

If using rbenv:
```
rbenv install 1.9.3-p392
rbenv install jruby-1.7.3
```
(Mac OS X only) Download and install XQuartz: https://xquartz.macosforge.org/landing/

Install the rest of the dependencies: (TODO: instructions for non-OSX platforms.)

# Install Python, setuptools, and pip.  You can skip this
# if you already have them.
brew install python
curl http://python-distribute.org/distribute_setup.py | python
curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python

# Install numpy (feel free to put it in a virtualenv); opencv dependency
pip install numpy

# Add the "science" tap to Homebrew so it can find OpenCV (if you haven't already)
brew tap homebrew/science

brew install opencv --with-tbb --with-opencl --with-qt
brew install mupdf redis

Download Tabula and install the Ruby dependencies. (Note: ensure that rbenv is configured for the standard Ruby interpreter, not JRuby)
```
git clone git://github.com/jazzido/tabula.git
cd tabula

gem install bundler
bundle install
```
Configure Tabula: Copy local_settings-example.rb to local_settings.rb. Edit local_settings.rb and set JRUBY_PATH to the path to the jruby executable.

If you are using rbenv, you can find the path to jruby by doing:
```
RBENV_VERSION='jruby-1.7.3' rbenv which jruby
```

Starting the Server (Dev)

Start redis-server in a separate terminal tab

redis-server /usr/local/etc/redis.conf

Next, you need to start resque and the actual web server. You can run both of those using Foreman by running the following:

bundle exec foreman start

The site instance should now be viewable at http://127.0.0.1:9292/

Contributing

Interested in helping out? See TODO.md for ideas.

smartchicago / tabula

readme