lnklnklnk / ga-bq

Stream raw hit-level Google Analytics data into BigQuery
90 stars 38 forks source link

Google Analytics -> BigQuery streaming

Stream raw hit-level Google Analytics data into BigQuery

Installation

  1. Create new project here https://console.developers.google.com/project
  2. Create new dataset in Google BigQuery https://bigquery.cloud.google.com
  3. Download and install Google App Engine python SDK https://cloud.google.com/appengine/downloads
  4. git clone https://github.com/lnklnklnk/ga-bq.git
  5. Create new app from source in Google SDK
  6. Set gcloud project: gcloud config set project your-project
  7. Change gifPath in js/gabq.js to ga-tracker-dot-[your-project].appspot.com/collect
  8. Set project_id (your-project), dataset_id (from step 2), table_id in bqloader.py
  9. Deploy application: gcloud app deploy app.yaml
  10. Visit ga-tracker-dot-[your-project].appspot.com/tasks/create_bq_table to create BigQuery table. (Expected response if everything goes well is simply an ok)
  11. Include plugin on your website. Add line: <script async src="http://ga-tracker-dot-[your-project].appspot.com/js/gabq.js"></script> after GA code and ga('require', 'gabqplugin'); after ga('create',..)
  12. Now you raw GA data collects in BigQuery table

Note: Ecomerce data is currently not supported, it will be added soon

Tuning

In case you have more than 1000 events per minute you may duplicate number of cron workers by copy pasting them in cron.yaml, e.g. something like this:

cron:
- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins  

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins  

- description: process queue
  url: /tasks/process_queue
  schedule: every 1 mins  

Take in mind that there is an limit - you can not lease more than 1000 rows from queue at once, so we are scaling this by number of cronjobs, so now each minute we will be able to proccess 5000 events. While playing around we have noticed that there is an limit to number of cronjobs at 60 - so with this in mind, you may grow up to 60 000 per minute.

Troubleshooting

Internal Server error (UnknownQueueError) when sending data to /collect

If you don't see your pull-queue queue in the Cloud Tasks underneath Pull queues display on the developer console, try deploying the queue config explicitly:

gcloud app deploy queue.yaml