gratipay / gratipay.com

Here lieth a pioneer in open source sustainability. RIP
https://gratipay.news/the-end-cbfba8f50981
MIT License
1.12k stars 308 forks source link

Clean up db schema #1549

Closed zbynekwinkler closed 6 years ago

zbynekwinkler commented 11 years ago

Currently we have a mix of different approaches. Decide on a single way and implement it.

--- There is a **[$15 open bounty](https://www.bountysource.com/issues/991604-clean-up-db-schema?utm_campaign=plugin&utm_content=tracker%2F85909&utm_medium=issues&utm_source=github)** on this issue. Add to the bounty at [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F85909&utm_medium=issues&utm_source=github).

There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

zbynekwinkler commented 11 years ago

My preferred solution would involve

No magic means slightly more work when implementing but results in an approachable code base, allows for easier review (requires less context from the reviewer), enables straight forward reasoning about performance (big O).

Read optimized (big O) helps us to ignore the performance details of postgres. For anyone interested I suggest reading several chapters in Postgres 9.0 High Performace to see everything that can go wrong for example in query planning without proper maintenance of the db. The more complex the query the more can (will?) go wrong.

Storing user actions allows us to restore the current state into a whatever schema we choose. Guards us against errors in the code, business logic etc. A generic table with a timestamp, participant and data (probably hstore) columns would be my preferred way.

On the other hand I do not want to throw out the baby with the bath water so let's discuss :smile:.

chadwhitacre commented 11 years ago

I've asked @zwn to lead the charge on this one. IRC

Keep scrolling on that IRC link, basically the single events table design sounds sensible to me. The reason we have multiple logging tables (see #1502 for details) is historical accident, not intentional design.

Anyway, @zwn is in charge of this now.

zbynekwinkler commented 11 years ago

I have gone over all (or maybe just most?) pages we have and noted what data is required where (just data reading not writing). Would something like this fit somewhere in the developer documentation? Feel free to comment/improve the list. It should form a list of requirements for the data layer. This is what should be optimized for reading. Did I miss anything?


zbynekwinkler commented 11 years ago

I've studied a bit materialized views to see what could be done using them. The problem seems to be that update of the view cannot be done concurrently yet. This means that the update locks the view. If we used materialized views for the homepage tables, we would not be able to query them while the update is running. It is the same behavior as using TRUNCATE while rebuilding the tables (https://github.com/gittip/www.gittip.com/issues/1541#issuecomment-26535261). Refreshing the view concurrently is planed only for Postgres 9.4.

chadwhitacre commented 11 years ago

Refreshing the view concurrently is planned only for Postgres 9.4.

Oh well. I guess we wait. Can you tell from past history when 9.4 is likely to be out?

chadwhitacre commented 11 years ago

Filing this here:

http://www.paperplanes.de/2013/10/18/the-smallest-distributed-system.html

zbynekwinkler commented 11 years ago

I've been exploring what can be done with triggers and rules system in 50e298c and e9572b5.

I see several possible ways to implement the log table:

The advantage of doing this on db level is that also handmade changes done through psql are logged. The disadvantage is that we cannot tie the action to a signed in user. Further disadvantage is that writing triggers in plpgsql language is yet another language to deal with. This could be somewhat mitigated by using plpython but that is not possible on current heroku offering. So I am starting to lean towards the application level implementation on the grounds that it is going to be easier to maintain.

I'd appreciate any feedback for and/or against any of the variants.

chadwhitacre commented 11 years ago

@zwn What does it look like if we invert the pattern and write to the single event table first as the canonical source, and then work outwards from there to update the normalized tables as downstream consumers?

zbynekwinkler commented 11 years ago

@whit537 I haven't tried to implement it yet. I got somewhat scared by plpgsql and EXECUTE that would have to be used to build and execute the queries. I am much more comfortable in python.

Do you think it is worth exploring in more depth?

chadwhitacre commented 11 years ago

IRC

chadwhitacre commented 11 years ago

Here's a few scaling posts from Facebook:

chadwhitacre commented 11 years ago

We decided to pursue the idea of writing to a single master event log, with read-optimized tables computed from there. We're thinking in terms of eventually having Python workers asynchronously updating the downstream tables (with UI support for this async updating). As a step on the way, though, we're going to look at instrumenting all the places in the app where we currently write to the database to write to a single master event log as an additional part of the transaction. Then once we've lived with this for a while we'll start inverting the app bit by bit to write to the event log in one transaction and update computed tables in a separate transaction. IRC

zbynekwinkler commented 11 years ago

I am thinking about how to structure the log table. From one point of view it would be good to structure it as close to the user as possible since this gives us the most resilience against bugs. If the user 'bob' says he wants to tip user 'alice' amount $1 we should record it as such. On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future.

I've decided to make the api be user facing (taking 'alice' and 'bob' as params) but storing the internal ids in the log table. It might make sense to reopen #835.

Also I'd like to convert the different 'toggle' operations to explicit idempotent set operations. This allows us not to search for all the previous 'toggles' all the way to the default at the beginning. Fortunately there isn't that many of them.

zbynekwinkler commented 11 years ago

Should we do #412 and #224 before or after?

zbynekwinkler commented 11 years ago

What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.

zbynekwinkler commented 11 years ago

What would be the best way to log take_over action? Within takeover a lot of tips manipulation is going on. I am thinking about having the action logged per partes. So if only one of the elsewheres were taken, it would result in one log item. If it was the last one, the tips zeroing would happen one by one by generic set_tip operation, as well as setting up the new tips. The same goes for absorptions. This will have the advantage of being able reconstruct all the tips in a generic way no matter if they come from takeover or not.

chadwhitacre commented 11 years ago

Should we do #412 and #224 before or after?

After, I think. Those two are not as critical as this ticket.

chadwhitacre commented 11 years ago

Some transforms are not that easy to describe in SQL.

E.g.? I haven't felt this limitation yet, but I expect the changes we're making here will be our most significant yet.

chadwhitacre commented 11 years ago

What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.

This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.

zbynekwinkler commented 11 years ago

This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.

I am so far managing without it.

chadwhitacre commented 11 years ago

On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future.

Well, the way we handle this now is with an ON UPDATE CASCADE on participants.username. Whenever that column changes, any column referencing it as a foreign key also changes.

zbynekwinkler commented 10 years ago

@whit537 I am wrestling with what to do with stuff like participant.py#L574? It creates direct link between the inner workings of payday and a fairly general participant class and current db schema. I'd rather keep payday algo within payday files and do as little queries over the log table as possible (it will keep growing and unbounded queries will get progressively slower over time).

I am considering an alternative implementation. For now I am going with current tips table which follows the current tips view but it is table. This would allow us to create a copy of this table when the payday starts and after that run the payday algo against this table.

It would allow us to keep payday restartable the way it is now but we would not have to search through the whole history table to find out what tips were current when the payday was started. Breaking the link with how the payday works and the data is logged would allow us to try different things in the future like running payday from a backup db and bringing over just the results (as was suggested somewhere) and also things like running the algo "from memory" https://github.com/gittip/www.gittip.com/issues/1486#issuecomment-27225090.

chadwhitacre commented 10 years ago

@zwn Sounds reasonable on the face of it. Would it be helpful to schedule a Hangout to go over your work on this so far? This is a huge refactor and it probably wouldn't hurt for me to keep up with what you're doing rather than waiting until "the end."

zbynekwinkler commented 10 years ago

@whit537 I've been really busy in my day job lately so not much work has been done. Also I will not be able to get back to this until this coming weekend. I'd like to push something for review on Sunday.

chadwhitacre commented 10 years ago

@zwn Thanks for the heads up. As discussed on IRC let's aim to have this landed by the end of December (firefighting and PR merging notwithstanding).

chadwhitacre commented 10 years ago

I'd like to push something for review on Sunday.

Monday afternoon/evenings seem to line up to be online at the same time. Maybe we can plan to review together then?

chadwhitacre commented 10 years ago

I have an event next Monday evening at 6:30 that I'll have to leave for at 5:30 or so. Hopefully we can spend some time together before that.

zbynekwinkler commented 10 years ago

@whit537 What would be the best way to deal with make_participant? It bypasses all the calls I've set up in Participant.py to do the logging :(

zbynekwinkler commented 10 years ago

I am thinking about overriding Participant:set_attributes to be the entry point to logging changes to the participant table.

zbynekwinkler commented 10 years ago

Ok, so not. Model:set_attributes is used every time a new instance is created and not only when it is changed/updated. Anyway, the idea stays the same, just using set_attributes_logged.

zbynekwinkler commented 10 years ago

@whit537 What was the design decision to create the username_lower column in the db? I have found #503 but I do not see the reason there. Why didn't we add the unique constraint on lower(username) instead?

chadwhitacre commented 10 years ago

@zwn Have you done an ack/grep for username_lower? I think it's mostly used for canonicalizing. I.e., when I hit:

https://www.gittip.com/ZwN/

chadwhitacre commented 10 years ago

Also, I have a friend in business intelligence at www.pnc.com. His kids are starting to come online as db admins and he's asked if I have anything in Gittip for them to work on to get some experience and references. We're setting up a meeting. Let's see if there's any way we can use them here?

chadwhitacre commented 10 years ago

@zwn Re: make_participant, it's hard for me to say without understanding how you've implemented logging. It's designed to be a quick and dirty way to get a participant object fixture for a test case. I guess that's a casualty of moving to this new schema design (log everything low-level, build normalized tables downstream). Can I see some code to wrap my head around this?

zbynekwinkler commented 10 years ago

One thing I've noticed while playing with the log table is that we really often write to the db. When starting with an empty db and signing in with github, I've ended up with 19 rows in the log, 13 of them are session_expires. Looking at the code I see that we indeed write to the db on each and every request of an authenticated user.

zbynekwinkler commented 10 years ago

Unfortunately it is not that obvious what to do instead. If we get a solution for #715 (like using signed cookies) we can use it to encode the expiration time into the cookie value and renew it only once a day.

zbynekwinkler commented 10 years ago

Actually it wasn't that hard since we already have Participant loaded it is easy to compare to the stored expires and update only once a day.

chadwhitacre commented 10 years ago

Actually it wasn't that hard since we already ...

Cool. I was going to suggest that we could also use redis/memcache for storing sessions instead of Postgres, but if you've solved it w/ Postgres that keeps things simpler.

zbynekwinkler commented 10 years ago

While running into a risk of making this issue long and convoluted I'll add some observations that can help us decide on the best way forward. [After rereading I think I should get a blog :wink:]

I am exploring the best way to calculate the read optimized tables from the log of actions. Obviously there are some trade offs - readability of the code, risk of bugs, execution speed etc.

  1. Do only the smallest necessary updates to the read tables - i.e. when doing a transfer of money between users take the current balances and decrement one and increment the other.
  2. Recalculate the affected data from scratch - for our example with the transfer that would mean calculation of the two balances using current info all about transfers (inside gittip) and exchanges (outside gittip) for the two users.
  3. Recalculate the type of data that was affected - again with the transfer that would mean to recalculate all the balances of all the users.

The solutions above seem to be ordered from the fastest to the slowest. But is this true? The running time of https://github.com/gittip/www.gittip.com/issues/1118?source=c#issuecomment-29026186 got me wondering if this assumption is right and what the difference might be because the solutions are also ordered by resiliency to bugs from the least resilient to the most.

The tl;dr conclusion is that simple aggregation is really fast (doing one sequential scan through a table). Adding joins, subselects or partial updates can make it crawl. The times I get on my computer are:

  1. 15ms
  2. 127ms with query from #1118 run twice, targeted by username
  3. different times for different implementations
    • using subquery (update balance set (select where subquery)) 5s
    • using with statement to precalculate and selecting from it to update 1.6s
    • using the query from #1118 and storing the result as a new table (username, balance) 260ms
    • doing (delete from; insert into) 160ms

The disadvantage of the solutions with the recalculation is that their runtime is going to increase the more data we will have (transfers table has 133k rows and one seq scan costs around 80ms). When expecting exponential growth of gittip in the year 2014 we could quickly run out of options.

Which brings me back to materialized views. We have Postgres 9.3 available that has them but for update needs exclusive lock. If I had to guess I would say that to refresh the materialized view Postgres drops the data and recreates it again so the performance could be similar to option 3 above. How fast it will be most likely depends on how fast is the query behind the view. During that time the view would be locked (when we grow that could soon be a pretty long time).

Another way to implement option 3 is to do the materialize views "by hand" using the idiom "delete * from table; insert into..." which we are using to update the homepage tables (#1547).

Updating balances in participants table using precalculated data in a separate balances table (username, balance) takes 1.2s and I do not see what value it brings. We can easily (and cheaply) do it at read time (32ms):

select count(*) 
from participants p 
left join balances b on p.username=b.username;

Based on the above experiments I am leaning towards approximating columns storage engines for the read optimized tables (separate tables for balances, dollars_receiving, dollars_giving and maybe some others). It can still be updated in place (option 1) but also the recalculation is fast and we have open options to using even materialized views if we choose so.

I guess the main reason for getting the best performance this way is not storing the zeros and nulls and thus dealing with less data overall (the balances table has only 2.9k rows - that is the number of active gittip uses of all time compared to 55k rows in participants table).

zbynekwinkler commented 10 years ago

Can you tell from past history when 9.4 is likely to be out?

Looking at postgres timeline I would say that there is one release per year. So 9.4 is likely to be available next fall.

chadwhitacre commented 10 years ago

@zwn I'm seeing two concerns in your blog post:

  1. Schema for read-optimized tables (one big participants table vs. separate tables).
  2. Update strategy for read-optimized tables (incremental updates vs. recalculations).

Sounds like (1) is about optimizing space and (2) is about optimizing time.

On the schema (1), some initial googling suggests that NULL values may not actually take up space:

In which case, I think we should stick with multi-column tables, because those are easier to reason about at the application layer (someone working in the simplate layer wants to know that they can do participant.balance and have it "just work").

On the update strategy (2), I'm inclined to do incremental updates (option 1), because it's the fastest. It introduces more bug risk, but part of having the event log is that we assure ourselves that we can recover from corruption when we need to. We have had some deep bugs so far (most recently #1704). I don't have the sense that our software is buggy overall, that we're swamped by lots of really bad bugs. And we are already mitigating our bug risk via better testing, documentation, and code review practices.

It remains to be seen what incremental updates will feel like at the application layer.

bruceadams commented 10 years ago

I really like what is getting talked about in here. Can I help?

In @zwn's "blog post" above, he describes different mechanisms for updating the fast-read views (I'm not using "view" as a technical term, please don't read specific implementation into that word) of the data. One of the great things about have a log of actions is, if the fast-read views get corrupted, or even lost completely, we can rebuild them from the log. We will want to implement a slow (maybe batch) mechanism for completely generating the fast-read views, even if we don't use it very often. With that safety net in place, we can run fast updates as normal operations, even if there are risks. The fast side wouldn't even have to be in a disk based database. It could live in Redis or some other fast, but maybe not rock solid, data store.

We could run the slow mechanism periodically to double check that our fast-views remain consistent with the log data.

The next level of sophistication is to have two tiers of fast-read. We would only do this down the road. But, having the log in place makes it possible. One tier of fast-read view is only only updated occasionally by a slow, batch, and rock solid process. Then, have a "what's changed recently" view that is layed on top of the slower moving fast-read view. Most NoSQL datastores are rock solid and screamingly fast, and highly scale-able in a read-only mode. (They can get into trouble with updates--no promise of consistency.) So, a NoSQL layer for all the data in a fast-read view, with a layer on top (maybe Redis, maybe just another chunk of the same NoSQL engine) for recent changes. Build everything with the understanding that the quick changing fast-read view will fail on occasion. Be prepared to rebuild it. Since it only contains recent changes, one need only read the recent part of the logs to build it. This two tiered approach requires an abstraction layer for reading data to correctly mix the recent and older data.

chadwhitacre commented 10 years ago

We could run the slow mechanism periodically to double check that our fast-views remain consistent with the log data.

I love this idea. One of the most impressive programming feats I've been privileged to witness was an upgrade to a major new version of a proprietary, internal database that went off without a single hitch. The developer on the project ran both versions in parallel for months, and programmatically compared the logs of both (he added significant logging to the old version to support this) until he was sure that they were performing consistently.

zbynekwinkler commented 10 years ago

Yay! Someone cares :sunny:. I was thinking about running the slower "rebuild everything" from the self check I've started in #1768. In that branch I am collecting all the knowledge I currently have about the business relationships of the data we have. This knowledge has been scattered around in code and has not been entirely obvious. Anyone who can review that and tell me if the tests there make sense would be of a great help. Once we have that merged we will be one step closer to taming this beast :wink:

The combination with different types of stores (redis, memcached etc.) would make sense in the future when gittip is huge. It is good to start thinking about it soon enough so we can tailor the backend the right way.

joeyespo commented 10 years ago

Note to @whit537 and @seanlinsley, look up event sourcing.

Here's a good place to start. (And a longer version).

zbynekwinkler commented 10 years ago

http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

chadwhitacre commented 10 years ago

Good find, @zwn. We're on the right track.

[Edit to add:]

http://docs.datomic.com/architecture.html

zbynekwinkler commented 10 years ago

http://blog.balancedpayments.com/the-ledger/ http://blog.balancedpayments.com/state-machines/ http://www.martinfowler.com/apsupp/accounting.pdf Filing it here so we keep everything in one place.

chadwhitacre commented 10 years ago

Good call.

zbynekwinkler commented 10 years ago

Since I was asked for updates on #1993... I have implemented it 2.5 times now already but was not happy with the result. I am starting another run, hopefully this time it "works". I am not asking anyone to block on this, I will pick up the changes as I go.