Adjust postgres DB schema and reduce storage needed

There is a lot potential to improve the database schema in order to reduce storage needs / disk size. This will help to reduce costs. At the moment database size mainly depends on two tables results and tasks. Also the indexes take up a ton of space as well. (See overview below.)

 schema_name |          relname           |  size   |  table_size
-------------+----------------------------+---------+--------------
 public      | results                    | 124 GB  | 133230313472
 public      | results_pkey               | 94 GB   | 100788797440
 public      | tasks                      | 68 GB   |  73426419712
 public      | results_userid             | 63 GB   |  67274162176
 public      | results_projectid          | 50 GB   |  53498740736
 public      | results_taskid             | 49 GB   |  52157546496
 public      | results_groupid            | 32 GB   |  33935441920
 public      | results_timestamp_date_idx | 31 GB   |  33730682880
 public      | tasks_pkey                 | 19 GB   |  20713758720
 public      | tasks_task_id              | 11 GB   |  11902173184
 public      | tasks_projectid            | 7699 MB |   8073166848
 public      | tasks_groupid              | 5203 MB |   5455806464
 public      | groups                     | 116 MB  |    121364480
 public      | groups_pkey                | 61 MB   |     64421888
 public      | groups_goupid              | 37 MB   |     38715392
 public      | groups_projectid           | 21 MB   |     21667840
 public      | users                      | 5328 kB |      5455872
 public      | spatial_ref_sys            | 4584 kB |      4694016
 public      | users_pkey                 | 4472 kB |      4579328
 public      | users_userid               | 4376 kB |      4481024
 public      | projects                   | 2128 kB |      2179072
 public      | results_temp               | 1176 kB |      1204224
 public      | spatial_ref_sys_pkey       | 192 kB  |       196608
 public      | projects_pkey              | 72 kB   |        73728

The current structure of the results table contains highly redundant information. project_id, group_id and user_id, timestamp, start_time, end_time are stored for all tasks of a group although they are identical for all tasks.

CREATE TABLE IF NOT EXISTS results (
    project_id varchar,
    group_id varchar,
    user_id varchar,
    task_id varchar,
    "timestamp" timestamp,
    start_time timestamp DEFAULT NULL,
    end_time timestamp DEFAULT NULL,
    result int,
    PRIMARY KEY (project_id, group_id, task_id, user_id),
    FOREIGN KEY (project_id) REFERENCES projects (project_id),
    FOREIGN KEY (project_id, group_id) REFERENCES groups (project_id, group_id),
    FOREIGN KEY (project_id, group_id, task_id) REFERENCES tasks (project_id, group_id, task_id),
    FOREIGN KEY (user_id) REFERENCES users (user_id)
);

We could introduce another table called group_results. This would also be closer to the structure we already use in Firebase. With this table we could also do a lot of things for which we don't need the actual result value. E.g. we could calculate the overall progress of a group, the number of users over time, the total number of results etc. And this table would be ~100 times smaller, assuming that a group contains about 100 tasks. This could somehow like this.

CREATE TABLE IF NOT EXISTS group_results (
    result_id int,
    project_id int,
    group_id int,
    user_id int,
    "timestamp" timestamp,
    session_length int
    result_count int,
    PRIMARY KEY (result_id),
    FOREIGN KEY (project_id) REFERENCES projects (project_id),
    FOREIGN KEY (project_id, group_id) REFERENCES groups (project_id, group_id),
    FOREIGN KEY (project_id, group_id, task_id) REFERENCES tasks (project_id, group_id, task_id),
    FOREIGN KEY (user_id) REFERENCES users (user_id)
);

CREATE TABLE IF NOT EXISTS results (
    result_id int,
    task_id int,
    result int,
    PRIMARY KEY (result_id, task_id),
    FOREIGN KEY (result_id) REFERENCES group_results (result_id)
);

Finally, it might be beneficial to use table partitioning. This could improve performance for queries that need to access data only for specific projects. This is a very common task for us as we usually extract and process results on a project basis.

This link has some interesting comments on saving space

[ ] write down how we want to query the database

Proposed steps to address this issues:

[x] 0. Build test dataset -> dev database is test dataset
[x] 1. adjust init_db: all tables get new integer "postgres"id, the other "firebase"id should optimally be only save once here, but should never be referenced directly
[x] 2. check if create Project on workers is affected by new columns (project_id, prob not, but group_id)
[x] 3. in transfer_results/save_results_to_postgres in lowest query: adjust to new schema with a join (probably)
[ ] 4. overhaul export module to fit new data schema
[ ] 5. write Script which transforms old schema to new schema in new db

maby test if indices on multiple columns improve performance, e.g.:

CREATE INDEX tasks_idx ON tasks using btree  (project_id, group_id, ?task_id?);

-> queries should mostly involve all the id columns, right? MultiColumn indices are also usable for only one of the columns, but the leftmost column is always the best one

2end meeting notes:

[x] leave taskid as varchar, no seperate firebase from postgres id in this case
[x] -> this means we dont need to join on task in insert results but instead on groups
[x] remove extra indices -> see table below, mostly multicolumn indexes (the pkeys) are used, rest is not necessary

 relid | indexrelid | schemaname |     relname     |        indexrelname        |  idx_scan  | idx_tup_read | idx_tup_fetch
-------+------------+------------+-----------------+----------------------------+------------+--------------+---------------
 18015 |      18021 | public     | results         | results_pkey               | 1337110083 |  85961952477 |   84673455153
 17971 |      17977 | public     | groups          | groups_pkey                |  789246169 |    863869261 |     863869261
 17963 |      17969 | public     | projects        | projects_pkey              |  786261209 |    791482289 |     787827285
 17986 |      17992 | public     | tasks           | tasks_pkey                 |  749112873 |    750198518 |     750198518
 18007 |     554848 | public     | users           | users_userid               |  732340918 |    732340918 |     732333934
 18015 |      18043 | public     | results         | results_projectid          |     114568 |  59548210026 |   58688639078
 18007 |      18013 | public     | users           | users_pkey                 |      12547 |            0 |             0
 18015 |     549882 | public     | results         | results_timestamp_date_idx |       2199 |   6276491604 |    5728927495
 18015 |      18046 | public     | results         | results_userid             |       1486 |     87994290 |      10731281
 17986 |      18006 | public     | tasks           | tasks_projectid            |       1107 |     53932047 |      53932047
 16694 |      16701 | public     | spatial_ref_sys | spatial_ref_sys_pkey       |        654 |          654 |           654
 17971 |      17985 | public     | groups          | groups_goupid              |        405 |       247659 |        183330
 18015 |      18044 | public     | results         | results_groupid            |         12 |           12 |            12
 17971 |      17984 | public     | groups          | groups_projectid           |         12 |           12 |            12
 18015 |      18045 | public     | results         | results_taskid             |          3 |           27 |            27
 17986 |      18005 | public     | tasks           | tasks_groupid              |          0 |            0 |             0
 17986 |      18004 | public     | tasks           | tasks_task_id              |          0 |            0 |             0
(17 rows)

Tests: Setup new schemas and fill with dev data:

[ ] dont replace ids with ints, but split results into group_results and results
[ ] switch project and group id with int

-> look at difference of size and performance of accesses.

Updating these findings as of today, just to be sure we're chasing the right problem:

Table and index sizes

SELECT
  schema_name,
  relname,
  pg_size_pretty(table_size) AS size,
  table_size
FROM (
       SELECT
         pg_catalog.pg_namespace.nspname           AS schema_name,
         relname,
         pg_relation_size(pg_catalog.pg_class.oid) AS table_size
       FROM pg_catalog.pg_class
         JOIN pg_catalog.pg_namespace ON relnamespace = pg_catalog.pg_namespace.oid
     ) t
WHERE schema_name = 'public' AND table_size > 0 ORDER BY table_size DESC;

gives

 schema_name |          relname           |    size    |  table_size
-------------+----------------------------+------------+--------------
 public      | results                    | 141 GB     | 151915610112
 public      | results_pkey               | 102 GB     | 109276037120
 public      | results_userid             | 78 GB      |  83832397824
 public      | tasks                      | 76 GB      |  81614389248
 public      | results_projectid          | 62 GB      |  66519474176
 public      | results_groupid            | 39 GB      |  42159013888
 public      | results_timestamp_date_idx | 39 GB      |  41553936384
 public      | tasks_pkey                 | 21 GB      |  22915842048
 public      | tasks_projectid            | 8546 MB    |   8960737280
 public      | groups                     | 135 MB     |    141942784
 public      | groups_pkey                | 72 MB      |     75096064
 public      | groups_goupid              | 43 MB      |     45096960
 public      | groups_projectid           | 24 MB      |     25591808
 public      | results_temp_new           | 10 MB      |     10625024
 public      | users                      | 5800 kB    |      5939200
 public      | users_pkey                 | 4904 kB    |      5021696
 public      | spatial_ref_sys            | 4584 kB    |      4694016
 public      | users_userid               | 4392 kB    |      4497408
 public      | projects                   | 2464 kB    |      2523136
 public      | spatial_ref_sys_pkey       | 192 kB     |       196608
 public      | projects_pkey              | 80 kB      |        81920
 public      | results_temp               | 8192 bytes |         8192
(22 rows)

Put another way the results table weighs 140GB + 320GB of indexes on it.

Checking the usage of these, we get:

SELECT psai.* FROM pg_stat_all_indexes psai LEFT JOIN pg_index i ON i.indexrelid =psai.indexrelid WHERE psai.schemaname = 'public' ORDER BY psai.idx_scan desc;

gives

 relid | indexrelid | schemaname |     relname     |        indexrelname        | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+-----------------+----------------------------+----------+--------------+---------------
 17963 |      17969 | public     | projects        | projects_pkey              | 24146006 |     67657203 |      44581343
 17971 |      17977 | public     | groups          | groups_pkey                | 24024285 |     82795251 |      78013996
 18015 |      18021 | public     | results         | results_pkey               | 13490408 |        18344 |          3684
 17986 |      17992 | public     | tasks           | tasks_pkey                 | 13480209 |     13712283 |      13667019
 18007 |     554848 | public     | users           | users_userid               | 13457413 |     13457413 |      13457413
 18015 |      18043 | public     | results         | results_projectid          |    89532 |  27161546942 |   27161531590
 18007 |      18013 | public     | users           | users_pkey                 |     3687 |            0 |             0
 18015 |     549882 | public     | results         | results_timestamp_date_idx |     3600 |     35109505 |      35104591
 16694 |      16701 | public     | spatial_ref_sys | spatial_ref_sys_pkey       |     1800 |         1800 |          1800
 17986 |      18006 | public     | tasks           | tasks_projectid            |      167 |      3947303 |       3924671
 17971 |      17985 | public     | groups          | groups_goupid              |       57 |        27253 |         27099
 18015 |      18044 | public     | results         | results_groupid            |        0 |            0 |             0
 18015 |      18046 | public     | results         | results_userid             |        0 |            0 |             0
 17971 |      17984 | public     | groups          | groups_projectid           |        0 |            0 |             0
(14 rows)

We can drop:

results_userid - 78GB
results_projectid - 62GB
results_groupId - 39GB
result_timestamp_date_idx - 39GB TOTAL: ~220GB off

mapswipe / python-mapswipe-workers

Adjust postgres DB schema and reduce storage needed #529

Table and index sizes