mikeizbicki / pgrollup

easy creation of rollup tables in postgresql (compute count(*) queries in constant time)
1 stars 1 forks source link

2021summer #1

Open mikeizbicki opened 3 years ago

mikeizbicki commented 3 years ago

Our big-picture goal for this project is to make this pgrollup library support more types of materialized views. I'd in particular like to get it to support outer joins and subqueries. It's a reasonably complicated software project, however, so there's a number of tasks that I'll have you do to get more familiar with the code base before starting on these larger projects.

Task 1:

Read the following article: https://hashrocket.com/blog/posts/materialized-view-strategies-using-postgresql It describes several different ways of maintaining materialized views in postgres. Notice that they all have lots of drawbacks, and pay attention to how the trigger based views are implemented. The purpose of this library is to automate the trigger based techniques and make them much more efficient.

Task 2:

There's lots of test cases in the sql folder that you'll eventually have to get familiar with. Make sure that you can run the test cases with the commands:

$ docker-compose up -d --build
$ docker-compose exec db make installcheck

All the tests should pass.

The simplest test cases are in sql/test_types_basic.sql so we'll start looking there. In order to understand the tests, you'll also need to read and understand the assert_rollup function in the file pgrollup--1.0.sql. Basically, whenever a materialized view is created, the pgrollup extension creates a lot of relations. Two of them are {rollup_name}_groundtruth and {rollup_name}, where {rollup_name} is the name of the materialized view. The _groundtruth relation is a view that recomputes the entire relation from scratch every time it is accessed. This is inefficient, but guaranteed to be correct. The {rollup_name} relation is a table that is automatically maintained to have the correct values. It is efficient, but potentially incorrect if the code is wrong. The assert_rollup function checks that these two relations match.

Convince yourself that the materialized views in the test cases are being automatically updated without having to call refresh materialized view.

Task 3:

In this task, we'll look at the actual code generated by the pgrollup library in order to maintain the materialized view. By default, as soon as the extension is loaded, then all materialized views will automatically be managed by the library. First, we'll have to disable this behavior. Modify the sql/test_types_basic.sql file so that the following line is run immediately after the pgrollup extension is loaded

update pgrollup_settings set value='manual' where name='create_rollup';

this will cause the rollups to not be created automatically. If you rerun the test cases, you'll see that the test case for test_types_basic.sql now fails because assert_rollup does not succeed. (Look in the folder results/ or the file regressions.diff to see the failure.)

You can manually create rollups with commands like

select pgrollup_convert('test_rollup1');

called after creating the materialized view.

In order to see the code generated by this command, run the following commands instead:

SET client_min_messages TO NOTICE;
select pgrollup_convert('test_rollup1', dry_run => True);

You should see a NOTICE output containing about 300 lines of SQL code, which is the code generated by the pgrollup library during normal operation. If you look closely, this code will contain lots of triggers (similar to the code you saw in task 1), and the definitions of the relations used by assert_rollup (that you saw in task 2).

Task 4:

This is the first task where you'll actually have to change library code. Currently, the library requires python 3.6+, and we want to make it work with python 3.5. The reason for this is that the docker images for postgres 10 and 11 use python 3.5, and I want the library to work on these older versions of postgres too.

First, edit the file .github/workflows/tests.yml so that the POSTGRES_VERSION variable includes versions 10 and 11. Push to github and notice that these versions fail the test cases because of the old python version.

Then, upgrade the code so that it supports the older version of python. The main feature of python3.6 that is currently being used is called f-strings. Basically, all code that looks something like

var = f'this is a {variable}'

will need to be replaced by something like

var = 'this is a '+variable

F-strings are currently used pretty extensively, so it will require that you change quite a bit of code. Make sure to regularly run the test cases (locally, not on github actions) to ensure that they still pass. This will be our check that you've done all the modifications correctly.

mikeizbicki commented 3 years ago

@valenxie I forgot to mention before that Monday is a holiday in the US, and so we won't be meeting tomorrow. We'll have our next meeting on Tuesday after the paper reading.

valenxie commented 3 years ago

Hi Mike @mikeizbicki, I've done the modifications of f-strings and all local tests passed, though git action is still failing. Any suggestions to debug? Here is my repo: https://github.com/valenxie/pgrollup

mikeizbicki commented 3 years ago

Next task: do Lark tutorial: https://lark-parser.readthedocs.io/en/latest/json_tutorial.html

mikeizbicki commented 3 years ago

The overall control flow of the library is:

  1. The function parse_create in pgrollup/parser.py is used to extract a dictionary of important info from an input sql statement.
  2. The class Rollup takes this dictionary as input and outputs the SQL code that generates the actual rollup tables.

Currently, there are bugs in step 1 that cause the code to not handle all select statements. I want us to modify the parsing code so that it uses the parsing code built into postgres, which will let us handle arbitrary select statements. The idea is to use the function raw_parser as the input to the parse_create function instead of the raw sql query. You can get some example output by running the command

select raw_parser('
    SELECT
        sum(num)
    FROM testjoin1
    JOIN testjoin2 USING (id)
    GROUP BY name
;');

in psql.

Your tasks:

  1. Create a new file pgrollup/parser_pg.py
  2. For each test case sqlX in the pgrollup/parser.py file, run the raw_parser command to create a new test case in the new format, and add this test case as a variable with the same name into the pgrollup/parser_pg.py file.
  3. Create a new function called parse_create in the new file. For each test case, you should ensure that the new function generates the same output as the original function. Writing this function will require using the Lark library to parse the output of raw_parser.
valenxie commented 3 years ago

Hi Mike @mikeizbicki , git action is still failing but passes if I change the Postgres version to 12&13. One of the logs shows: `/usr/include/postgresql/10/server/c.h:81:19: fatal error: stdio.h: No such file or directory

include

               ^

compilation terminated.`

mikeizbicki commented 3 years ago

Thanks, I'll fix it to work on the older versions of postgres from here. Go ahead and submit the pull request without the modified github actions.

(And in the future, you need to use triple backticks to quote multiline text code blocks with proper formatting.)

valenxie commented 3 years ago

Hi Mike @mikeizbicki, so I tried to update the variable inside the dictionary this way:

larg = _getjoins(info['larg'])
rarg = _getjoins(info['rarg'])
larg[1]['join_type'] = _gettype(info)
rarg[0]['join_type'] = _gettype(info)
join_infos.append(larg)
join_infos.append(rarg)
return join_infos

The modification is based on the fact that _getjoins(info['larg'])[0][0]['join_type'] currently returns 'FROM'. However, the code gives KeyError: 0 but 'join_type' exists as a key in the dict. Would you give me some hint on solving this bug? Thank you.

valenxie commented 3 years ago

Hi Mike @mikeizbicki, I think I've messed things up when trying to squash my commits..git rebase -i HEAD~5 used to give me the commits that I've made, but since I made some wrong edits I did rm -fr ".git/rebase-merge", right now git rebase -i HEAD~5 returns

pick fd4fcb4 change name to pgrollup
pick e8642c6 add notes
pick 09d6b74 fix tests for pg12; add c function to extract parse tree
pick 1fe25bb slightly improved parsing of group by clause
pick 904f19f adjust test case sensitivity for datasketches

which seems to be the commits that you have made, I'm not so sure what to do at this point. I sent out a pull request that contains parser_pg.py, please look at that first and I will fix the squashing issue along with stylistic changes.

valenxie commented 3 years ago

Hi Mike @mikeizbicki Today when I was trying to make the edits, the pgrollup folder on my terminal was followed with (detached HEAD %|REBASE-i), then I realized my parser_pg.py was completely gone. I know that the problem comes from the commit squashing issue which I forgot to mention yesterday, and I googled to see if there are any working solutions, but I wanted to make the changes after I consult with you in case I mess things up again..In one of the posts(https://stackoverflow.com/questions/5772192/how-can-i-reconcile-detached-head-with-master-origin) it suggests me to create a new branch and update the master branch, but after running git diff master temp I see it returning a few but not all changes that I've made on f-strings, for example:

@@ -694,13 +694,13 @@ RETURNS TEXT AS $$
             ''')
         res = plpy.execute(sql)
         t_oid = res.coltypes()[0]
-        sql = 'select typname,typlen from pg_type where oid='+str(t_oid)+' limit 1;' 
+        sql = f'select typname,typlen from pg_type where oid={t_oid} limit 1;'
         row = plpy.execute(sql)[0]
         return row

At this point I don't feel it's proper to just point master to this new branch but I'm not sure what the next step should be.

mikeizbicki commented 3 years ago

Our next task is to create some benchmarks that measure how much overhead the pgrollup library adds to insert statements created on a table. We'll use the pgbench utility for this purpose (see https://www.postgresql.org/docs/13/pgbench.html for docs).

You'll need to create two script files.

  1. The init.sql file will install the extensions, create the data tables, and create the rollups.
  2. The transact.sql file will run a series of inserts and updates onto the original table (not the rollup). This file will be run many times by pgbench and it will measure how long it takes to run each time.

Adding more rollup tables into the init.sql file will result in a slowdown because the database must do more processing to maintain the rollups; our goal is to measure exactly how much that slowdown will be. Ultimately, we would like several graphs where the y-axis is transactions per second (TPS, computed by pgbench) and the x-axis is either the number of rollup tables, the number of columns in the rollup tables, etc.

Each point in the plot will correspond to an individual run of pgbench on a different init.sql/transact.sql files. So eventually we will need an automated system for creating these files for us that satisfy certain parameters. But for now, just try to hand code init.sql and transact.sql files that are based off of the tests in sql/test_types_float.sql, and get pgbench to successfully run these files.

You should create a new folder bench and put all your code in this folder. You'll also need to create a Dockerfile and docker-compose.yml file that will run the benchmarks.

We can talk more in detail about this tomorrow.