statistikZH / statbotData

MIT License
0 stars 1 forks source link

statbotData

The project

This repo is part of the Statbot Swiss Project. The project has following parts:

This repo takes care of the following two tasks:

flowchart TD
    Fed[Federal Datasets] -->|Preprocess| Scripts[Statbot scripts]
    Scripts -->|Upload| DB(Statbot postgress DB)
    Can[Cantonal Datasets] -->|Preprocess| Scripts
    User -->|Ask Questions| Chatbot(Statbot Swiss)
    Chatbot -->|query| DB(Statbot postgreSQL DB)

Content of the repo

The repo consists of the following parts:

Statbot input data

The data/const directory consists of two files:

Dataset Input Data

The statbot input data statbot_input_data.csv describes datasets that are available as opendata and have the following properties:

classDiagram
    Dataset <|-- PxCube
    Dataset <|-- CsvFile
    Dataset <|-- ExcelSheet
    Dataset <|-- LINDAS
    Dataset : data_indicator
    Dataset : status
    Dataset : name
    Dataset : format
    Dataset : lang
    Dataset : publisher
    Dataset : organization
    Dataset: +create_dataset()
    Dataset: +download_data()
    class PxCube{
      px_id
      access_url
      size
      encoding
    }
    class CsvFile{
      download_url
    }
    class ExcelSheet{
      download_url
    }
    class LINDAS{
      query_url
    }

mandatory properties:

Optional properties for the px file format: for the px file format the following parameter needs to be specified:

Additional optional parameters for this file format are:

By default the R package BFS is used to retrive the data by default. It gets the data via api calls in batches. In the case of a large dataset the api limit might be an obstacle. Therefore if the dataset is classified as large the the package pxRRead is used for the download.

Properties for the data access:

Statbot Postgres DB

One goal of this repo ist to fill up the Statbot Postgres DB. This database has the following tables:

erDiagram
    metadata_tables
    metadata_table_columns
    dataset_observation }o--|| spatial_unit : spatial_scope

dataset: each dataset is uploaded into exactly one table in Statbot Postgres DB: all tables contain these two mandatory columns:

Besides these tables, there are extra tables.

spatial unit table: contains spatial units for all administrative units in Switzerland. spatialinit_uid is the primary key in that table. The spatialunit_uid column in the dataset tables serves as foreign key in the dataset tables.

extra metadata_tables: there are two extra metadata tables: their task is to provide additional metadata to the dataset tables.

metadata_tables

This metadata table has the following fields:

metadata_table_columns

This metadata table has the following fields:

Statbot Training Data

The statbot training data consists of questions in natural language and the corresponding sql queries. The queries relate to a single input dataset that has been uploaded to the Statbot Postgres DB. The queries don't have joins instead of one possible join on the spatial_unit table.

Below is an examples for the Statbot Training Data:

-- Wieviele Abstimmungsvorlagen gab es im Jahr 2000?
SELECT COUNT(*) as anzahl_abstimmungsvorlagen_im_jahr_2000
FROM abstimmungsvorlagen_seit_1971 as T
JOIN spatial_unit as S on T.spatialunit_uid = S.spatialunit_uid
WHERE S.name_de='Schweiz' AND S.country=TRUE AND T.jahr=2000;

-- Wieviele Abstimmungsvorlagen gab es seit dem Jahr 1971?
SELECT COUNT(*) as anzahl_abstimmungsvorlagen
FROM abstimmungsvorlagen_seit_1971 as T
JOIN spatial_unit as S on T.spatialunit_uid = S.spatialunit_uid
WHERE S.name_de='Schweiz' AND S.country=TRUE;

Statbot Chatbot

The Chatbot is not yet publically available and currently under active development. More information on this part of the project will be added later.

The Pipelines

The pipelines are intended to process datasets, that have been selected in the input dataset list statbot_input_data.csv and prepare them for the Statbot Postgres DB. Besides the dataset preprosessing script the pipeline directory also contains the training data as Statbot Training Data

Content of the pipelines

Each pipeline has a directory in pipelines by the name of the data_indicator (from the input data). It consists of the following files:

Working on the pipelines

To work on a pipeline, open the file <data_indicator>.R and run the pipeline, adjust where needed. Once you are happy with ds$postgres_export you can open a script in the scripts directory to upload the dataset to postgres.

More details on the file types in the pipelines

queries.sql

The queries.sql files contain the Statbot Training Data for each dataset. See this section for a detailed description of these files.

queries.log

These files are the output of a run of statbotData::testrun_queries(ds) on the queries in the queries.sql file of the dataset. A timestamp indicates the date and time of the run. It also contains information on whether the file was run on the remote postgres instance REMOTE or on the local ds$postgres data frame.

metadata_tables.csv

These are metadata for a table in postgres. They are loaded into the postgres metadata table metadata_tables, see section on the Statbot Postgres DB above.

metadata_table_columns_.csv

These are metadata for the table columns in postgres. They are loaded into the postgres metadata table metadata_table_columns, see Statbot Postgres DB for details.

Scripts

But the code for these chores is always the same and consists of using the appropriate functions of the package statbotData. It does not depend on the specific pipeline or dataset, but on the situation and the status of the dataset. Therefore this code has been extracted from the pipelines and put into a directory of its own with a selection of different sample scripts that relate to certain scenarios:

Access to the Statbot Postgres DB

In order to run the functions, pipelines and scripts in this repo access to the Statbot Postgres DB is needed. This access is set up via an environment file:

cp sample.env .env

Afterwards you need to fill in your credentials into the .env file. Otherwise all fuctions that need postgres access won't work.

StatbotData package

The functions of the statbotData package consist of reusable functions that are used in the pipelines, the app and the scripts. There are man pages for the package, that explain the usage and arguments of these functions in more detail. This section will just provide a short overview on these functions.

Function to create the dataset class

Functions for data download and preprocessing

Functions to run if ds$postgres_export has been generated

ds$postgres_export is the processed dataset that can be uploaded to the postgres instance Statbot Postgres DB](#statbot-postgres-db). To build this is the goal of each of the pipelines. Certain function in statbotData perform operations on this tibble, in case the dataset is not yet in the status uploaded or remote:

Functions that run on the remote postgres instance and concern one dataset

Functions that run on the remote postgres instance and concern one postgres table

Functions for listing content of the remote postgres instance

Functions for the mappings of spatial units

Shiny App

The repo also contains an app: app.R: It can be started from R studio and provides an overview over all elements in the repo: