A prototype command-line app to upload Snowplow enriched events from local storage to Google BigQuery.
You will need:
The app is hosted from Bintray:
> wget http://dl.bintray.com/snowplow/snowplow-generic/bigquery_loader_cli_0.1.0.zip
> unzip bigquery_loader_cli_0.1.0.zip
First, sign up to BigQuery if you have not already done so, and enable billing.
Second, create a project, and make a note of the Project Number by clicking on the name of the project on the Google Developers Console.
Third, our command-line app will need credentials to access the BigQuery project:
client_secrets
file to the same directory that you unzipped the command-line appclient_secrets
file to client_secrets_<projectId>.json
, where <projectId>
is the Project Number obtained earlierAssuming that you are running the Snowplow Hadoop-based data pipeline with EmrEtlRunner, you can quickly retrieve January's enriched events using the following:
> aws --profile="xxx" s3 cp "s3://xxx-archive/enriched/good/" . --recursive \
--exclude "*" --include "run=2015-01-*"
> find . -type f -execdir bash -c 'd="${PWD##*/}"; [[ "$1" != "$d-"* ]] && mv "$1" "../$d-$1"' - '{}' \;
> find . -type d -exec rm -d {} \;
To upload your data you simply type the command:
> java -jar bigquery-loader-cli-0.1.0 --create-table \
<projectId> <datasetId> <tableId> <dataLocation>
where:
<projectId>
is the Project Number obtained from the Google development console<datasetId>
is the name of the dataset, which will be created if it doesn't already exist<tableId>
is the name of the table, which will be created if it doesn't already exist<dataLocation>
is the location of either a single file of Snowplow enriched events, or an un-nested folder of Snowplow enriched eventsThe first time you run this command, you will be prompted to go through Google's browser-based authentication process.
To append further data to the table simply run the command again, omitting the --create-table
flag and changing <dataLocation>
as appropriate.
Warning: loads are not idempotent. Running the command twice against the same files will result in two copies of the events being added to the table.
Assuming git, Vagrant and VirtualBox installed:
host> git clone https://github.com/snowplow/bigquery-loader-cli
host> cd bigquery-loader-cli
host> vagrant up && vagrant ssh
guest> cd /vagrant
guest> sbt test
Copyright 2015 Snowplow Analytics Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.