Closed zbynekwinkler closed 6 years ago
My preferred solution would involve
No magic means slightly more work when implementing but results in an approachable code base, allows for easier review (requires less context from the reviewer), enables straight forward reasoning about performance (big O).
Read optimized (big O) helps us to ignore the performance details of postgres. For anyone interested I suggest reading several chapters in Postgres 9.0 High Performace to see everything that can go wrong for example in query planning without proper maintenance of the db. The more complex the query the more can (will?) go wrong.
Storing user actions allows us to restore the current state into a whatever schema we choose. Guards us against errors in the code, business logic etc. A generic table with a timestamp, participant and data (probably hstore) columns would be my preferred way.
On the other hand I do not want to throw out the baby with the bath water so let's discuss :smile:.
I've asked @zwn to lead the charge on this one. IRC
Keep scrolling on that IRC link, basically the single events table design sounds sensible to me. The reason we have multiple logging tables (see #1502 for details) is historical accident, not intentional design.
Anyway, @zwn is in charge of this now.
I have gone over all (or maybe just most?) pages we have and noted what data is required where (just data reading not writing). Would something like this fit somewhere in the developer documentation? Feel free to comment/improve the list. It should form a list of requirements for the data layer. This is what should be optimized for reading. Did I miss anything?
I've studied a bit materialized views to see what could be done using them. The problem seems to be that update of the view cannot be done concurrently yet. This means that the update locks the view. If we used materialized views for the homepage tables, we would not be able to query them while the update is running. It is the same behavior as using TRUNCATE while rebuilding the tables (https://github.com/gittip/www.gittip.com/issues/1541#issuecomment-26535261). Refreshing the view concurrently is planed only for Postgres 9.4.
Refreshing the view concurrently is planned only for Postgres 9.4.
Oh well. I guess we wait. Can you tell from past history when 9.4 is likely to be out?
I've been exploring what can be done with triggers and rules system in 50e298c and e9572b5.
I see several possible ways to implement the log table:
The advantage of doing this on db level is that also handmade changes done through psql are logged. The disadvantage is that we cannot tie the action to a signed in user. Further disadvantage is that writing triggers in plpgsql language is yet another language to deal with. This could be somewhat mitigated by using plpython but that is not possible on current heroku offering. So I am starting to lean towards the application level implementation on the grounds that it is going to be easier to maintain.
I'd appreciate any feedback for and/or against any of the variants.
@zwn What does it look like if we invert the pattern and write to the single event table first as the canonical source, and then work outwards from there to update the normalized tables as downstream consumers?
@whit537 I haven't tried to implement it yet. I got somewhat scared by plpgsql and EXECUTE that would have to be used to build and execute the queries. I am much more comfortable in python.
Do you think it is worth exploring in more depth?
Here's a few scaling posts from Facebook:
We decided to pursue the idea of writing to a single master event log, with read-optimized tables computed from there. We're thinking in terms of eventually having Python workers asynchronously updating the downstream tables (with UI support for this async updating). As a step on the way, though, we're going to look at instrumenting all the places in the app where we currently write to the database to write to a single master event log as an additional part of the transaction. Then once we've lived with this for a while we'll start inverting the app bit by bit to write to the event log in one transaction and update computed tables in a separate transaction. IRC
I am thinking about how to structure the log table. From one point of view it would be good to structure it as close to the user as possible since this gives us the most resilience against bugs. If the user 'bob' says he wants to tip user 'alice' amount $1 we should record it as such. On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future.
I've decided to make the api be user facing (taking 'alice' and 'bob' as params) but storing the internal ids in the log table. It might make sense to reopen #835.
Also I'd like to convert the different 'toggle' operations to explicit idempotent set operations. This allows us not to search for all the previous 'toggles' all the way to the default at the beginning. Fortunately there isn't that many of them.
Should we do #412 and #224 before or after?
What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql
in the psql prompt we could do heroku run branch.py
.
What would be the best way to log take_over
action? Within takeover a lot of tips manipulation is going on. I am thinking about having the action logged per partes. So if only one of the elsewheres were taken, it would result in one log item. If it was the last one, the tips zeroing would happen one by one by generic set_tip
operation, as well as setting up the new tips. The same goes for absorptions. This will have the advantage of being able reconstruct all the tips in a generic way no matter if they come from takeover or not.
Should we do #412 and #224 before or after?
After, I think. Those two are not as critical as this ticket.
Some transforms are not that easy to describe in SQL.
E.g.? I haven't felt this limitation yet, but I expect the changes we're making here will be our most significant yet.
What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.
This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.
This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.
I am so far managing without it.
On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future.
Well, the way we handle this now is with an ON UPDATE CASCADE
on participants.username
. Whenever that column changes, any column referencing it as a foreign key also changes.
@whit537 I am wrestling with what to do with stuff like participant.py#L574? It creates direct link between the inner workings of payday and a fairly general participant class and current db schema. I'd rather keep payday algo within payday files and do as little queries over the log table as possible (it will keep growing and unbounded queries will get progressively slower over time).
I am considering an alternative implementation. For now I am going with current tips table which follows the current tips view but it is table. This would allow us to create a copy of this table when the payday starts and after that run the payday algo against this table.
It would allow us to keep payday restartable the way it is now but we would not have to search through the whole history table to find out what tips were current when the payday was started. Breaking the link with how the payday works and the data is logged would allow us to try different things in the future like running payday from a backup db and bringing over just the results (as was suggested somewhere) and also things like running the algo "from memory" https://github.com/gittip/www.gittip.com/issues/1486#issuecomment-27225090.
@zwn Sounds reasonable on the face of it. Would it be helpful to schedule a Hangout to go over your work on this so far? This is a huge refactor and it probably wouldn't hurt for me to keep up with what you're doing rather than waiting until "the end."
@whit537 I've been really busy in my day job lately so not much work has been done. Also I will not be able to get back to this until this coming weekend. I'd like to push something for review on Sunday.
@zwn Thanks for the heads up. As discussed on IRC let's aim to have this landed by the end of December (firefighting and PR merging notwithstanding).
I'd like to push something for review on Sunday.
Monday afternoon/evenings seem to line up to be online at the same time. Maybe we can plan to review together then?
I have an event next Monday evening at 6:30 that I'll have to leave for at 5:30 or so. Hopefully we can spend some time together before that.
@whit537 What would be the best way to deal with make_participant? It bypasses all the calls I've set up in Participant.py to do the logging :(
I am thinking about overriding Participant:set_attributes to be the entry point to logging changes to the participant table.
Ok, so not. Model:set_attributes is used every time a new instance is created and not only when it is changed/updated. Anyway, the idea stays the same, just using set_attributes_logged.
@whit537 What was the design decision to create the username_lower
column in the db? I have found #503 but I do not see the reason there. Why didn't we add the unique constraint on lower(username)
instead?
@zwn Have you done an ack/grep for username_lower? I think it's mostly used for canonicalizing. I.e., when I hit:
Also, I have a friend in business intelligence at www.pnc.com. His kids are starting to come online as db admins and he's asked if I have anything in Gittip for them to work on to get some experience and references. We're setting up a meeting. Let's see if there's any way we can use them here?
@zwn Re: make_participant, it's hard for me to say without understanding how you've implemented logging. It's designed to be a quick and dirty way to get a participant object fixture for a test case. I guess that's a casualty of moving to this new schema design (log everything low-level, build normalized tables downstream). Can I see some code to wrap my head around this?
One thing I've noticed while playing with the log table is that we really often write to the db. When starting with an empty db and signing in with github, I've ended up with 19 rows in the log, 13 of them are session_expires
. Looking at the code I see that we indeed write to the db on each and every request of an authenticated user.
Unfortunately it is not that obvious what to do instead. If we get a solution for #715 (like using signed cookies) we can use it to encode the expiration time into the cookie value and renew it only once a day.
Actually it wasn't that hard since we already have Participant loaded it is easy to compare to the stored expires and update only once a day.
Actually it wasn't that hard since we already ...
Cool. I was going to suggest that we could also use redis/memcache for storing sessions instead of Postgres, but if you've solved it w/ Postgres that keeps things simpler.
While running into a risk of making this issue long and convoluted I'll add some observations that can help us decide on the best way forward. [After rereading I think I should get a blog :wink:]
I am exploring the best way to calculate the read optimized tables from the log of actions. Obviously there are some trade offs - readability of the code, risk of bugs, execution speed etc.
The solutions above seem to be ordered from the fastest to the slowest. But is this true? The running time of https://github.com/gittip/www.gittip.com/issues/1118?source=c#issuecomment-29026186 got me wondering if this assumption is right and what the difference might be because the solutions are also ordered by resiliency to bugs from the least resilient to the most.
The tl;dr conclusion is that simple aggregation is really fast (doing one sequential scan through a table). Adding joins, subselects or partial updates can make it crawl. The times I get on my computer are:
The disadvantage of the solutions with the recalculation is that their runtime is going to increase the more data we will have (transfers table has 133k rows and one seq scan costs around 80ms). When expecting exponential growth of gittip in the year 2014 we could quickly run out of options.
Which brings me back to materialized views. We have Postgres 9.3 available that has them but for update needs exclusive lock. If I had to guess I would say that to refresh the materialized view Postgres drops the data and recreates it again so the performance could be similar to option 3 above. How fast it will be most likely depends on how fast is the query behind the view. During that time the view would be locked (when we grow that could soon be a pretty long time).
Another way to implement option 3 is to do the materialize views "by hand" using the idiom "delete * from table; insert into..." which we are using to update the homepage tables (#1547).
Updating balances in participants table using precalculated data in a separate balances table (username, balance) takes 1.2s and I do not see what value it brings. We can easily (and cheaply) do it at read time (32ms):
select count(*)
from participants p
left join balances b on p.username=b.username;
Based on the above experiments I am leaning towards approximating columns storage engines for the read optimized tables (separate tables for balances, dollars_receiving, dollars_giving and maybe some others). It can still be updated in place (option 1) but also the recalculation is fast and we have open options to using even materialized views if we choose so.
I guess the main reason for getting the best performance this way is not storing the zeros and nulls and thus dealing with less data overall (the balances table has only 2.9k rows - that is the number of active gittip uses of all time compared to 55k rows in participants table).
Can you tell from past history when 9.4 is likely to be out?
Looking at postgres timeline I would say that there is one release per year. So 9.4 is likely to be available next fall.
@zwn I'm seeing two concerns in your blog post:
Sounds like (1) is about optimizing space and (2) is about optimizing time.
On the schema (1), some initial googling suggests that NULL values may not actually take up space:
In which case, I think we should stick with multi-column tables, because those are easier to reason about at the application layer (someone working in the simplate layer wants to know that they can do participant.balance
and have it "just work").
On the update strategy (2), I'm inclined to do incremental updates (option 1), because it's the fastest. It introduces more bug risk, but part of having the event log is that we assure ourselves that we can recover from corruption when we need to. We have had some deep bugs so far (most recently #1704). I don't have the sense that our software is buggy overall, that we're swamped by lots of really bad bugs. And we are already mitigating our bug risk via better testing, documentation, and code review practices.
It remains to be seen what incremental updates will feel like at the application layer.
I really like what is getting talked about in here. Can I help?
In @zwn's "blog post" above, he describes different mechanisms for updating the fast-read views (I'm not using "view" as a technical term, please don't read specific implementation into that word) of the data. One of the great things about have a log of actions is, if the fast-read views get corrupted, or even lost completely, we can rebuild them from the log. We will want to implement a slow (maybe batch) mechanism for completely generating the fast-read views, even if we don't use it very often. With that safety net in place, we can run fast updates as normal operations, even if there are risks. The fast side wouldn't even have to be in a disk based database. It could live in Redis or some other fast, but maybe not rock solid, data store.
We could run the slow mechanism periodically to double check that our fast-views remain consistent with the log data.
The next level of sophistication is to have two tiers of fast-read. We would only do this down the road. But, having the log in place makes it possible. One tier of fast-read view is only only updated occasionally by a slow, batch, and rock solid process. Then, have a "what's changed recently" view that is layed on top of the slower moving fast-read view. Most NoSQL datastores are rock solid and screamingly fast, and highly scale-able in a read-only mode. (They can get into trouble with updates--no promise of consistency.) So, a NoSQL layer for all the data in a fast-read view, with a layer on top (maybe Redis, maybe just another chunk of the same NoSQL engine) for recent changes. Build everything with the understanding that the quick changing fast-read view will fail on occasion. Be prepared to rebuild it. Since it only contains recent changes, one need only read the recent part of the logs to build it. This two tiered approach requires an abstraction layer for reading data to correctly mix the recent and older data.
We could run the slow mechanism periodically to double check that our fast-views remain consistent with the log data.
I love this idea. One of the most impressive programming feats I've been privileged to witness was an upgrade to a major new version of a proprietary, internal database that went off without a single hitch. The developer on the project ran both versions in parallel for months, and programmatically compared the logs of both (he added significant logging to the old version to support this) until he was sure that they were performing consistently.
Yay! Someone cares :sunny:. I was thinking about running the slower "rebuild everything" from the self check I've started in #1768. In that branch I am collecting all the knowledge I currently have about the business relationships of the data we have. This knowledge has been scattered around in code and has not been entirely obvious. Anyone who can review that and tell me if the tests there make sense would be of a great help. Once we have that merged we will be one step closer to taming this beast :wink:
The combination with different types of stores (redis, memcached etc.) would make sense in the future when gittip is huge. It is good to start thinking about it soon enough so we can tailor the backend the right way.
Note to @whit537 and @seanlinsley, look up event sourcing.
http://blog.balancedpayments.com/the-ledger/ http://blog.balancedpayments.com/state-machines/ http://www.martinfowler.com/apsupp/accounting.pdf Filing it here so we keep everything in one place.
Good call.
Since I was asked for updates on #1993... I have implemented it 2.5 times now already but was not happy with the result. I am starting another run, hopefully this time it "works". I am not asking anyone to block on this, I will pick up the changes as I go.
Currently we have a mix of different approaches. Decide on a single way and implement it.
There is a $15 open bounty on this issue. Add to the bounty at Bountysource.