snowplow-archive / bigquery-loader-cli

UNMAINTAINED. Prototype CLI app for uploading Snowplow enriched events to BigQuery
http://snowplowanalytics.com
5 stars 3 forks source link

BigQuery Loader CLI

Build Status Release License

Overview

A prototype command-line app to upload Snowplow enriched events from local storage to Google BigQuery.

Getting Started

1. Dependencies

You will need:

2. Installing

The app is hosted from Bintray:

> wget http://dl.bintray.com/snowplow/snowplow-generic/bigquery_loader_cli_0.1.0.zip
> unzip bigquery_loader_cli_0.1.0.zip

3. BigQuery setup

First, sign up to BigQuery if you have not already done so, and enable billing.

Second, create a project, and make a note of the Project Number by clicking on the name of the project on the Google Developers Console.

Third, our command-line app will need credentials to access the BigQuery project:

  1. Click on the *Consent screen link in the APIs and auth section of the Developer Console, add an Email address and hit Save
  2. Click on the Credentials link in the APIs and auth section
  3. Click on the create new Client ID button, selecting Installed application as the application type and other as the installed application type
  4. Click CreateClient Id and then Download JSON to save the file
  5. Save the client_secrets file to the same directory that you unzipped the command-line app
  6. Rename the client_secrets file to client_secrets_<projectId>.json, where <projectId> is the Project Number obtained earlier

4. Downloading some Snowplow enriched events

Assuming that you are running the Snowplow Hadoop-based data pipeline with EmrEtlRunner, you can quickly retrieve January's enriched events using the following:

> aws --profile="xxx" s3 cp "s3://xxx-archive/enriched/good/" . --recursive \ 
    --exclude "*" --include "run=2015-01-*"
> find . -type f -execdir bash -c 'd="${PWD##*/}"; [[ "$1" != "$d-"* ]] && mv "$1" "../$d-$1"' - '{}' \;
> find . -type d -exec rm -d {} \;

5. Uploading a first batch of events

To upload your data you simply type the command:

> java -jar bigquery-loader-cli-0.1.0 --create-table \
    <projectId> <datasetId> <tableId> <dataLocation>

where:

The first time you run this command, you will be prompted to go through Google's browser-based authentication process.

6. Uploading further batches of events

To append further data to the table simply run the command again, omitting the --create-table flag and changing <dataLocation> as appropriate.

Warning: loads are not idempotent. Running the command twice against the same files will result in two copies of the events being added to the table.

Developer Quickstart

Assuming git, Vagrant and VirtualBox installed:

 host> git clone https://github.com/snowplow/bigquery-loader-cli
 host> cd bigquery-loader-cli
 host> vagrant up && vagrant ssh
guest> cd /vagrant
guest> sbt test

Copyright and license

Copyright 2015 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.