Open mikeizbicki opened 3 years ago
@valenxie I forgot to mention before that Monday is a holiday in the US, and so we won't be meeting tomorrow. We'll have our next meeting on Tuesday after the paper reading.
Hi Mike @mikeizbicki, I've done the modifications of f-strings and all local tests passed, though git action is still failing. Any suggestions to debug? Here is my repo: https://github.com/valenxie/pgrollup
Next task: do Lark tutorial: https://lark-parser.readthedocs.io/en/latest/json_tutorial.html
The overall control flow of the library is:
parse_create
in pgrollup/parser.py
is used to extract a dictionary of important info from an input sql statement.Rollup
takes this dictionary as input and outputs the SQL code that generates the actual rollup tables.Currently, there are bugs in step 1 that cause the code to not handle all select statements. I want us to modify the parsing code so that it uses the parsing code built into postgres, which will let us handle arbitrary select statements. The idea is to use the function raw_parser
as the input to the parse_create
function instead of the raw sql query. You can get some example output by running the command
select raw_parser('
SELECT
sum(num)
FROM testjoin1
JOIN testjoin2 USING (id)
GROUP BY name
;');
in psql.
Your tasks:
pgrollup/parser_pg.py
sqlX
in the pgrollup/parser.py
file, run the raw_parser
command to create a new test case in the new format, and add this test case as a variable with the same name into the pgrollup/parser_pg.py
file.parse_create
in the new file. For each test case, you should ensure that the new function generates the same output as the original function. Writing this function will require using the Lark library to parse the output of raw_parser
.Hi Mike @mikeizbicki , git action is still failing but passes if I change the Postgres version to 12&13. One of the logs shows: `/usr/include/postgresql/10/server/c.h:81:19: fatal error: stdio.h: No such file or directory
^
compilation terminated.`
Thanks, I'll fix it to work on the older versions of postgres from here. Go ahead and submit the pull request without the modified github actions.
(And in the future, you need to use triple backticks to quote multiline text code blocks with proper formatting.)
Hi Mike @mikeizbicki, so I tried to update the variable inside the dictionary this way:
larg = _getjoins(info['larg'])
rarg = _getjoins(info['rarg'])
larg[1]['join_type'] = _gettype(info)
rarg[0]['join_type'] = _gettype(info)
join_infos.append(larg)
join_infos.append(rarg)
return join_infos
The modification is based on the fact that _getjoins(info['larg'])[0][0]['join_type']
currently returns 'FROM'
.
However, the code gives KeyError: 0
but 'join_type' exists as a key in the dict. Would you give me some hint on solving this bug? Thank you.
Hi Mike @mikeizbicki,
I think I've messed things up when trying to squash my commits..git rebase -i HEAD~5
used to give me the commits that I've made, but since I made some wrong edits I did rm -fr ".git/rebase-merge"
, right now git rebase -i HEAD~5
returns
pick fd4fcb4 change name to pgrollup
pick e8642c6 add notes
pick 09d6b74 fix tests for pg12; add c function to extract parse tree
pick 1fe25bb slightly improved parsing of group by clause
pick 904f19f adjust test case sensitivity for datasketches
which seems to be the commits that you have made, I'm not so sure what to do at this point. I sent out a pull request that contains parser_pg.py
, please look at that first and I will fix the squashing issue along with stylistic changes.
Hi Mike @mikeizbicki
Today when I was trying to make the edits, the pgrollup
folder on my terminal was followed with (detached HEAD %|REBASE-i)
, then I realized my parser_pg.py
was completely gone. I know that the problem comes from the commit squashing issue which I forgot to mention yesterday, and I googled to see if there are any working solutions, but I wanted to make the changes after I consult with you in case I mess things up again..In one of the posts(https://stackoverflow.com/questions/5772192/how-can-i-reconcile-detached-head-with-master-origin) it suggests me to create a new branch and update the master branch, but after running git diff master temp
I see it returning a few but not all changes that I've made on f-strings, for example:
@@ -694,13 +694,13 @@ RETURNS TEXT AS $$
''')
res = plpy.execute(sql)
t_oid = res.coltypes()[0]
- sql = 'select typname,typlen from pg_type where oid='+str(t_oid)+' limit 1;'
+ sql = f'select typname,typlen from pg_type where oid={t_oid} limit 1;'
row = plpy.execute(sql)[0]
return row
At this point I don't feel it's proper to just point master to this new branch but I'm not sure what the next step should be.
Our next task is to create some benchmarks that measure how much overhead the pgrollup
library adds to insert statements created on a table. We'll use the pgbench
utility for this purpose (see https://www.postgresql.org/docs/13/pgbench.html for docs).
You'll need to create two script files.
init.sql
file will install the extensions, create the data tables, and create the rollups.transact.sql
file will run a series of inserts and updates onto the original table (not the rollup). This file will be run many times by pgbench
and it will measure how long it takes to run each time.Adding more rollup tables into the init.sql
file will result in a slowdown because the database must do more processing to maintain the rollups; our goal is to measure exactly how much that slowdown will be. Ultimately, we would like several graphs where the y-axis is transactions per second (TPS, computed by pgbench) and the x-axis is either the number of rollup tables, the number of columns in the rollup tables, etc.
Each point in the plot will correspond to an individual run of pgbench
on a different init.sql
/transact.sql
files. So eventually we will need an automated system for creating these files for us that satisfy certain parameters. But for now, just try to hand code init.sql
and transact.sql
files that are based off of the tests in sql/test_types_float.sql
, and get pgbench
to successfully run these files.
You should create a new folder bench
and put all your code in this folder. You'll also need to create a Dockerfile
and docker-compose.yml
file that will run the benchmarks.
We can talk more in detail about this tomorrow.
Our big-picture goal for this project is to make this pgrollup library support more types of materialized views. I'd in particular like to get it to support outer joins and subqueries. It's a reasonably complicated software project, however, so there's a number of tasks that I'll have you do to get more familiar with the code base before starting on these larger projects.
Task 1:
Read the following article: https://hashrocket.com/blog/posts/materialized-view-strategies-using-postgresql It describes several different ways of maintaining materialized views in postgres. Notice that they all have lots of drawbacks, and pay attention to how the trigger based views are implemented. The purpose of this library is to automate the trigger based techniques and make them much more efficient.
Task 2:
There's lots of test cases in the
sql
folder that you'll eventually have to get familiar with. Make sure that you can run the test cases with the commands:All the tests should pass.
The simplest test cases are in
sql/test_types_basic.sql
so we'll start looking there. In order to understand the tests, you'll also need to read and understand theassert_rollup
function in the filepgrollup--1.0.sql
. Basically, whenever a materialized view is created, thepgrollup
extension creates a lot of relations. Two of them are{rollup_name}_groundtruth
and{rollup_name}
, where{rollup_name}
is the name of the materialized view. The_groundtruth
relation is a view that recomputes the entire relation from scratch every time it is accessed. This is inefficient, but guaranteed to be correct. The{rollup_name}
relation is a table that is automatically maintained to have the correct values. It is efficient, but potentially incorrect if the code is wrong. Theassert_rollup
function checks that these two relations match.Convince yourself that the materialized views in the test cases are being automatically updated without having to call
refresh materialized view
.Task 3:
In this task, we'll look at the actual code generated by the
pgrollup
library in order to maintain the materialized view. By default, as soon as the extension is loaded, then all materialized views will automatically be managed by the library. First, we'll have to disable this behavior. Modify thesql/test_types_basic.sql
file so that the following line is run immediately after thepgrollup
extension is loadedthis will cause the rollups to not be created automatically. If you rerun the test cases, you'll see that the test case for
test_types_basic.sql
now fails becauseassert_rollup
does not succeed. (Look in the folderresults/
or the fileregressions.diff
to see the failure.)You can manually create rollups with commands like
called after creating the materialized view.
In order to see the code generated by this command, run the following commands instead:
You should see a
NOTICE
output containing about 300 lines of SQL code, which is the code generated by thepgrollup
library during normal operation. If you look closely, this code will contain lots of triggers (similar to the code you saw in task 1), and the definitions of the relations used byassert_rollup
(that you saw in task 2).Task 4:
This is the first task where you'll actually have to change library code. Currently, the library requires python 3.6+, and we want to make it work with python 3.5. The reason for this is that the docker images for postgres 10 and 11 use python 3.5, and I want the library to work on these older versions of postgres too.
First, edit the file
.github/workflows/tests.yml
so that thePOSTGRES_VERSION
variable includes versions 10 and 11. Push to github and notice that these versions fail the test cases because of the old python version.Then, upgrade the code so that it supports the older version of python. The main feature of python3.6 that is currently being used is called f-strings. Basically, all code that looks something like
will need to be replaced by something like
F-strings are currently used pretty extensively, so it will require that you change quite a bit of code. Make sure to regularly run the test cases (locally, not on github actions) to ensure that they still pass. This will be our check that you've done all the modifications correctly.