greenpeace / gpi-tl-hermes

News Aggregations and Sentiment Analysis app
2 stars 0 forks source link

Feed Firebase content into BigQuery #4

Closed krauthex closed 5 years ago

krauthex commented 5 years ago

Feed Firebase content into BigQuery

Description

The Firebase realtime DB works as a temporary database for the newest content (e.g. the last week/month), because it's easier to detect and prevent duplicates in Firebase. BigQuery is the archive, so every newly created entry in Firebase gets archived into BQ for later use.

Proposal

A python script running in GCP using a Cloud Function to be triggered by a write, update and/or create event from the Realtime Database, that seamlessly pours the data into BQ.

Updated Proposal with subtasks:

How to test the implementation?

Add something in Firebase and see if the new entries show up in BQ.

Relate issue

2

krauthex commented 5 years ago

So for the table header we still need to define the actual nested structure for nested fields :neutral_face: probably so for now I can create something, but as soon as we know the actual output, we need to change that...

krauthex commented 5 years ago

To give a short example of how this should look like:

from google.cloud.bigquery import SchemaField

# Schema
hermesSchema = [
    SchemaField('ID', 'STRING', mode='REQUIRED'),
    SchemaField('timestamp', 'TIMESTAMP', mode='REQUIRED'),
    SchemaField('articleContent', 'RECORD', mode='NULLABLE', 
                fields=(SchemaField('title', 'STRING'),
                        SchemaField('author', 'STRING'),
                        SchemaField('date', 'DATETIME'),
                        SchemaField('body', 'STRING'))
               )] 

So we do need to have a quite precise understanding of

  1. how the article (meta)data structure looks like
  2. what the output of openCalais/autoML is
  3. all the other stuff that we forgot until now