The goal of this issue is a minimal version of the ETL process in Node.js, Python, and Postgres as recently agreed upon in an RCR dev meeting.
Draft proposal for ETL refactor:
Create an "/admin" secure login page
a. Checks user login against the User table
b. Has a button to trigger the ETL process (eventually SR can run this)
c. Has a mandatory email field to send a report to (will send to RCR team automatically)
d. Page shows user-friendly information by default
e. Page has an option to show logging
Once the job is started:
a. Clear the RunMeta table
b. Clear the RunData table
c. Clear the Import1 and Import2 tables
d. Clear the Staging table
e. Update the RunMeta table with the time and email address of the initiator
f. Invoke the ETL Python script, which does the following:
i. Update the RunData table during each step in the process (verbose logging to help debug)
ii. Phase 1: import from airtables
1) Use the airtables module to query the API for the two tables we need (main and phone)
2) Store the results in tables Import1 and Import2
iii. Phase 2: merge multiple import tables into a single Staging table
1) Perform a query with a result set that is the sum of the two tables
2) Store results in the Staging table
iv. Phase 3: validate data
1) Perform checks, normalize data, and sanitize where needed and updates the Staging table
2) Determine if the data is acceptable or if the process needs to stop
v. Phase 4: geocode addresses
1) Use the google API to pull latitude and longitude for physical addresses
2) Update the Staging table
vi. Phase 5: additional processing against the Staging table
1) Perform additional pre-processing of the data so that client code doesn't need to
g. Compile results into HTML and send a notification email
Update the RCR codebase to pull data from a single table
The goal of this issue is a minimal version of the ETL process in Node.js, Python, and Postgres as recently agreed upon in an RCR dev meeting.
Draft proposal for ETL refactor: