This is a scraper for the matrixbooking api, to scrape bookings data and associated metadata into an Athena database, to allow for easier analysis with automatically updating data. The approach is modelled on the scraper for the Occupeye api. While it's geared towards the automated scrape use case, the functions are intended to be transparent enough to allow for interactive analysis and modification. In addition, there is a notebook with some useful functions to act as a scratchpad.
This script is used to define and create the Athena database, matrix_db
. It creates the tables; defines the columns in the tables; exports the schema into JSON (more on that later) and finally creates the database. It remains in this repo both to demostrate how it was defined, as well as to allow for an easy interface if the schema needs to be changed - just edit the script and run again to rebuild (though you'll likely need to rescrape the data if you do so).
The database contains three tables: bookings, locations and joined_rooms. Bookings and locations are updated via the scraper.
Joined_rooms is (currently) a manually-uploaded CSV, which maps the location_id that represents a booking across multiple rooms to the locations of the constituent rooms. For example, in 102PF, conference rooms 1A, 1B and 1C can be combined.
This is the main function for use in Airflow. This defines a command line function to scrape a specified day's worth of bookings. In the context of the Airflow task, it's designed to scrape bookings that occurred (or should have occurred) yesterday. These daily snapshots are combined into an Athena database, which is then queried and combined with the Occupeye db via a CTAS query to create an app db that the matrixbooking app (https://github.com/moj-analytical-services/matrixbooking) will query in turn.
This contains the main functions for scraping data from the API. The API documentation is here: https://developers.matrixbooking. Note that at the time of writing, the method of authentication is different from what is described in the documentation. The api url is https://app.matrixbooking.com/api/v1 rather than https://api.matrixbooking.com, and authentication is controlled by POSTing to api/v1/users/login, and receiving a cookie in return. That occurs in the matrix_authenticate(session)
function.
The main function, scrape_days_from_api
, works as follows:
get_bookings_df
, which:
pd.io.json.json_normalize
metadata/bookings_renames.json
, which is a dictionary of key/value pairs corresponding respectively to the source column names and the desired Athena database column namesbookings_renames.json
, and renames them to the values.{start_date}.parquet
Confirmed / cancelled / tentative - if you specify these statuses in the booking API call, then it only returns a subset of actual bookings - if you don't it only returns non-cancelled ones. We don't include status because it doesn't return everything.
Bookings table - https://app.matrixbooking.com/api/v1/booking
Locations table - https://app.matrixbooking.com/api/v1/org/43/locations
Note: The location API data is only a snapshot of current locations, there is no historic location data available from the API. This means that the historical location data should not be overwritten. (See Issue #56 for more details)
This script constructs the column rename files, so if they need to change (if the API and/or desired Athena schema changes), edit this script and rerun.
This script contains a single function, composed of several CTAS queries, which will delete and rebuild a synthesised database that matrixbooking can query. This database, matrixbooking_app_db
contains the following tables
This involves a slightly complex transformation, due to the need to deal with joined rooms. For bookings on multiple rooms, we need the location_id of the individual room in order to link to Occupeye, but still need to know that they're from joined rooms. It therefore starts with a subquery that left-joins the joined_rooms table, adds columns for the joined name and split name, and id, and uses coalesce() to use the split_id from joined_rooms if available, or just the location_id otherwise. The query then selects everything from this table, inner joins on the locations table to add the capacity and longQualifier (the zone and building), then inner joins on the Occupeye sensors metadata.
More simple. This just inner joins matrix_db.locations
to occupeye_db_live.sensors
, to get metadata on rooms from both matrix and occupeye.
Gets the sensor observations from Occupeye, attaches the metadata from the sensors table, and inner joins on matrix_db.locations
so it only get rooms on matrix.
This gets the survey name, start date and end date from the Occupeye surveys joined to the matrixbooking_app_db.locations
, to get just the occupeye surveys that have matrix data (and hence can be selected in the app).