juneau-project / juneau

A Jupyter notebook extension to centralize and manage data
https://juneau.readthedocs.io/
Apache License 2.0
14 stars 2 forks source link

The Juneau Project

juneau-project

The past decade has brought a sea change in the availability of data. Instead of a world in which we have small number of carefully curated data sources in a centralized database -- instead we have a plethora of datasets, data versions, and data representations that span users, groups, and organizations. Devices and data acquisition tools make it easy to acquire new data, cloud hosting makes it easy to centralize and share files, and cloud data analytics and machine learning tools have driven a desire to integrate and extract value from that data.

We have been missing management tools to centralize and capture such data resources. Data scientists often end up doing redundant work because they have no effective way of finding appropriate resources to reuse and retarget to new applications.

The Juneau Project develops holistic data management tools to find, standardize, and benefit from the existing resources in the data lake. This extension to Jupyter Notebook is a point of access for our dataset management tools.

For more on the project, please see the project home, as well as our research papers:

Setup

Prerequisites: relational and graph databases

Simple Version

Git clone the repo and build the docker juneau image:

docker build -t juneau -f docker/Dockerfile .

Now that we have built Juneau's image, run the three services (Postgres, Neo4j, and Juneau) via docker-compose:

docker-compose -f docker/docker-compose.yaml up

That's it! As you would normally do, head over to the link that Jupyter will show on the terminal.

Simple Version Using PennProv

Install Docker, including docker-compose, for your preferred operating system.

These will use the default user IDs and passwords that exist in config.yaml. You should change the password

Custom Version

First, be sure you have installed:

Then set up a default user ID and password for each:

Now either edit the YAML file in juneau/config/config.yaml to match your password and account info or change the environment variables in your terminal.

Sample data lake corpus

Next, download juneau_start.zip and unzip it.

For the Docker container, you can import as follows:

Otherwise, you can use:

And finally you need to edit the neo4j.conf file to set the database to data.db.

Install Jupyter Notebook extensions

See the Developer's Guide for details.

Install SQL UDFs

Copy the postgres directory into your hab-postgres docker container:

Log into your hab-postgres container with the interactive terminal:

apt update
apt install -y postgresql-server-dev-15
apt install -y gcc g++

mkdir /juneau_funcs/
cd /juneau_funcs/
cd join_size/c
cc -fPIC -c -I /usr/include/postgresql/15/server/ join_score.cpp score.cpp
cc -shared -o join_score.so join_score.o score.o
cd ../../sketch/c/ks
cc -fPIC -c -I /usr/include/postgresql/15/server/ ks.cpp hist.cpp evaluate.cpp
cc -shared -o ks.so ks.o hist.o evaluate.o
cd ../lshe
cc -fPIC -c -I /usr/include/postgresql/15/server/ -Ifnv/ fnv/hash_64a.c evaluate.cpp hash.cpp lshe.cpp probability.cpp sig.cpp
cc -shared -o lshe.so hash_64a.o evaluate.o hash.o lshe.o probability.o sig.o

Then run psql -U postgres and:

\i /juneau_funcs/join-size/sql/initialize_join_score.sql
\i /juneau_funcs/sketch/sql/initialize_sketch.sql