Clean up db schema - Githubissues

zbynekwinkler commented 11 years ago

Currently we have a mix of different approaches. Decide on a single way and implement it.

--- There is a **[$15 open bounty](https://www.bountysource.com/issues/991604-clean-up-db-schema?utm_campaign=plugin&utm_content=tracker%2F85909&utm_medium=issues&utm_source=github)** on this issue. Add to the bounty at [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F85909&utm_medium=issues&utm_source=github).

There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

chadwhitacre commented 10 years ago

The eating of an elephant begins with the first step. Can we start with something really simple?

create an events table with two columns: id and event (json).
update one method to also write to the events table. Maybe models.participant.Participant.set_tip_to?

chadwhitacre commented 10 years ago

The eating of an elephant begins with the first step.

:elephant: :runner:

zbynekwinkler commented 10 years ago

I am starting with something simple and believe me, set_tip_to is not it. Tips handling is easily the most complicated stuff there is in gittip. I am starting the participants table attributes.

chadwhitacre commented 10 years ago

@zwn Are you trying to read and write from the events table? I was suggesting we start by simply logging to it.

chadwhitacre commented 10 years ago

Tips handling is easily the most complicated stuff there is in gittip.

Hey! What about take_over? <:-)

chadwhitacre commented 10 years ago

Suggestion from @adewes at PyCon 2014 sprint is that we think in terms of three separate use cases:

Higher-order tables are source of truth. So in the context of is_suspicious (#2261) the participants.is_suspicious field would be the source of truth.
User-facing activity stream is implemented as a non-critical event log. Basically what we're doing with the event table as of #2006.
WAL archiving is used for data corruption recovery and replay.

chadwhitacre commented 10 years ago

I want to revisit materialized views, which surfaced above and then again in the so-called "blog post."

chadwhitacre commented 10 years ago

Our goal is to have both fine-grained data and high-level information. We have two basic strategies:

granular source of truth + views
coarse source of truth + event log

The granular source of truth is what we've been doing with tips, communities, and team memberships. We insert a new record every time someone updates a tip or team or community membership. Then we have views to see who is tipping whom right now (etc.). The problem is that views are expensive.

The coarse source of truth is what we've been doing with some fields of the participants table, logging changes into a secondary events table. We were doing this in Postgres using RULEs up until #2006, when we replaced that with application-layer logging into a single events via add_event.

chadwhitacre commented 10 years ago

A big underlying question here is the extent to which we're implementing logic in Postgres vs. Python. From @zwn:

"You have used perhaps all the standard and nonstandard postgres features there are" (src).
"I'd would pick straight SQL over all the cool hacks only the pros know anytime" (src).

chadwhitacre commented 10 years ago

I like implementing schema in Postgres. It's a powerful and mature system.

chadwhitacre commented 10 years ago

Here is how to manually implement materialized views in Postgres:

http://www.pgcon.org/2008/schedule/attachments/63_BSDCan2008-MaterializedViews.pdf

Here are ancillary documents:

http://www.pgcon.org/2008/schedule/attachments/64_BSDCan2008-MaterializedViews-paper.pdf — slideshow summarizing the above
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views — prior art that the above is based on
http://wiki.postgresql.org/wiki/Materialized_Views — wiki page where I discovered the above

chadwhitacre commented 10 years ago

Hashing out coarse vs. granular approach here at PyCon sprint with @adewes @boxhead2000 ...

chadwhitacre commented 10 years ago

Q: How about archiving? Five years in we have TONS of old granular data and we want to compact.

A: As long as our refresh function is incremental and not snapshotting then we should be able to archive as needed.

chadwhitacre commented 10 years ago

The biggest drawback to implementing materialized views in Postgres is that @zwn @adewes et al. don't want logic in Postgres, they want it in Python.

chadwhitacre commented 10 years ago

So what if we implemented materialized views in Python?

seanlinsley commented 10 years ago

I'm +1 on materialized views in Postgres, FWIW

chadwhitacre commented 10 years ago

The fundamental pattern is to insert into a granular table, and then trigger a refresh function to update a higher-order table. This may cascade.

So for example with communities we would have something like:

INSERT into communities
Trigger a refresh of current_communities
Trigger a refresh of community_summaries

chadwhitacre commented 10 years ago

The granular approach gives us many logging tables (essentially) that we'd want to collate into a stream or events view.

chadwhitacre commented 10 years ago

Seems like implementing at the application layer would give us flexibility to swap out the underlying datastore as we scale—premature optimization, or wise architectural decision?

chadwhitacre commented 10 years ago

I'm going to look at implementing view materialization in Postgres for teams.

seanlinsley commented 10 years ago

Seems like implementing at the application layer would give us flexibility to swap out the underlying datastore as we scale

I'm not convinced there's a better database to move to, even if we did need to "scale"

chadwhitacre commented 10 years ago

I think we can get a long way with Postgres.

adewes commented 10 years ago

OK so following our discussion, here's how I would organize the DB schema. Please add your comments and remarks, I'm just familiarizing myself with the codebase and I probably don't see all the use cases, so please correct me if I got it wrong.

In general, I would try to keep the current state of objects (i.e. participants, communities, teams and their respective data and membership) and the history of this state (i.e. when has a user joined a given community) in separate tables. In my opinion, this has several advantages:

1) Updating state and aggregating data (i.e. how many users are in a given community) becomes more straightforward 2) No materialized or calculated views are necessary for generating state information 3) The number of rows in the state table does not grow linearly with the number of user actions (e.g. each time a user joins a community) but only with the actual number of objects and/or relationships. 4) Event logging can be done in a more straightforward way using Postgres triggers, without modifying the Python code or application logic (see below)

So here's the proposed layout (all arrows represent m:n relations):

              +––––––––––––––––––+                
      +–––––––+participant_teams +–+              
      |       +––––––––––––––––––+ |              
      |                            |              
      |                            |              
+–––––+–––––+                  +–––++             
|participant|                  |team+–––+         
+––––––––+––+                  +––––+   |         
         |                              |         
         |                              |         
         |                              |         
 +–––––––+–––––––––––––––+    +–––––––––+––––––––+
 |participant_communities|    | team_communities |
 +––––––––––+––––––––––––+    +–––––––––+––––––––+
            |                           |         
            |                           |         
            |                           |         
            |       +–––––––––+         |         
            +–––––––+community+–––––––––+         
                    +–––––––––+

The M2M tables (participant_communities, team_communities, participant_teams) would roughly look like this:

    create table participant_communities (
    participant_id unsigned int REFERENCES participants,
    community_id unsigned int REFERENCES communities,
    suspicious boolean,
    state unsigned int, # e.g. 0 = pending, 1 = active, 2 = banned, ...
    ctime datetime,
    mtime datetime
    #possibly more status information
    )

participant_id and community_id are foreign keys pointing to the primary keys of the communities and participants tables, respectively. If a participant joins a community, the SQL operation would look like this:

insert into participant_communities (participant_id,community_id,suspicious) values (1,100, suspicious = false)

To get all participants in a given community xxx, the query would be

select * from participant where participant_id in (select participant_id from participant_communities where community_id = xxx)

To get all communities for a given participant yyy

select * form communities where community_id in (select community_id from participant_communities where participant_id = yyy)

For the other relations between participants and teams as well as teams and communities it could work in the same way. Instead of using numerical ids for referencing to other tables one could use strings as well (e.g. usernames), this would make updates of these values more difficult though.

Event Logging

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

CREATE TRIGGER log_community_updates
    BEFORE UPDATE ON user_communities
    FOR EACH ROW
    EXECUTE PROCEDURE log_event('community');

#Not checked if this works actually...
CREATE OR REPLACE FUNCTION log_event() RETURNS TRIGGER AS $log_event$
    BEGIN
        --
        -- Create a row in event_log to reflect the operation performed on the state table,
        -- make use of the special variable TG_OP to work out the operation.
        --
        IF (TG_OP = 'DELETE') THEN
            INSERT INTO event_log SELECT 'delete', now(), row_to_json(OLD.*);
            RETURN OLD;
        ELSIF (TG_OP = 'UPDATE') THEN
            INSERT INTO event_log SELECT 'update', now(), row_to_json(NEW.*);
            RETURN NEW;
        ELSIF (TG_OP = 'INSERT') THEN
            INSERT INTO event_log SELECT 'insert', now(), row_to_json(NEW.*);
            RETURN NEW;
        END IF;
        RETURN NULL;
    END;
$log_event$ LANGUAGE plpgsql;

the log_event function receives the old and new row values (http://www.postgresql.org/docs/9.2/static/plpgsql-trigger.html) and can insert them in a logging table. We could also define a separate logging function for each table or even operation, in order to e.g. store table-specific information in the logging table or have different logging tables for different state tables. This approach has the advantage of completely eliminating all calls to create_event in the Python code as well.

What do you think?

chadwhitacre commented 10 years ago

Another blog post! :dancer:

!m @adewes

clone1018 commented 10 years ago

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

Yes! Yes! Yes!

arothenberg commented 10 years ago

Someone was using my github as a private repo(with my permission) so I just changed my name and kicked him off.

Impressive - adewes. I'd hire you if I had a business. It looks like a good ole normalized rdbs schema.

My 2 cents(I don't know how the $$$contributions are given so take this with a grain of salt) - The application works well right now so there is no pressing need to reinvent it. If it's possible, I would first try and rename the membership table to Team. Membership is not a very accurate name. But since there are only currently a few tables it's not crucial that the schema be perfect. It works and that's more important.

One thing I don't see is groups addressed. Is there a further purpose in identifying a participant as a group other than as a permission for team creation or as just a descriptor? iow are there tables/process/functions that are specific to group(not team) participants? Also, are you tracking or planning to track group(not team) members? If so then adewes schema should fold groups into Team and flag the record as "group" or "team".

Anyway, it was great meeting all of you and it is very impressive that you have a working crowd source project up. I can't imagine how hard it would be to get something like this going. I know I couldn't.

adewes commented 10 years ago

@arothenberg @whit537 Yeah would be interesting to discuss how groups, teams and participants are related to each other. So far my understanding is the following:

-Participants can either be "singular" (i.e. representing one person) or "plural" (i.e. representing a group of persons). In the latter case they will be considered as a "group" and be able to add other participants to the group, thus creating a team. Is that correct @whit537 ?

arothenberg commented 10 years ago

Since I'm not very informed about this - this should be my last post.

Anyway, as per groups, my understanding is that it is a permissions flag as you stated. But in reality a group is also composed of multiple people(some may even be participants?) and contributions to groups may be handled differently. Your schema does not address the contributions and how they are allocated/recorded. What I'm trying to convey is that the business logic could alter your schema. From a superficial reading I think your schema looks perfect. But ours is a superficial gleaning of the Business logic. @whit537 would be the one to weigh in on this.

Take care and good luck.

zbynekwinkler commented 10 years ago

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

Yes! Yes! Yes!

No, no, no :wink: With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it. Using the add_event call is simple enough, isn't it? Also you don't have the right kind of info at the db level (like who is doing the action). BTW: I am really sorry I am not able to participate more at this time. At least I try to monitor what is happening.

chadwhitacre commented 10 years ago

So far my understanding is the following: [...] Is that correct @whit537 ?

@adewes Yup! Some participants are groups. Some groups are teams.

chadwhitacre commented 10 years ago

With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it.

If someone is afraid to touch the db that could be a sign that they're self-limited as a developer. Data is at the heart of software, and Postgres is a robust, mature, well-understood system for managing data. Why not use it? Why implement core data logic in Python instead of in Postgres?

That's question number one, and I don't think we've properly addressed it head-on yet. I see us as having taken one small step down each path, app-centric and db-centric. Should we take two steps down each path? What can we say based on our current level of experience with each pattern?

Question number two is whether our source of truth is granular or coarse. Do we update the participants table and then log to events (coarse)? Or do we insert into a low-level table and then bubble that up into a participants view/table (granular)?

Question number three is how questions number one and two are related. :-)

chadwhitacre commented 10 years ago

Having lived with the add_event pattern for a while now, I prefer the multiple log tables approach. When a separate add_event call is required ("coarse-grained source of truth"), then it introduces a chance for drift between how we log the event and the change we actually made. By contrast, if the fine-grained log is our source of truth, then there is no chance for drift.

I'm looking at the current update_email implementation on #2752. We log a set action with a current_email value, but what about the nonce and ctime? Maybe we don't care to keep those values around. I don't like the idea that we lose that info. Mightn't we want that for debugging? And more to the point: I want to change how we're doing email verification, and with the add_event pattern I have to make sure to modify the add_event call to stay in sync with the actual event itself.

chadwhitacre commented 10 years ago

@Changaco in IRC:

I'm quite happy that we removed the multiple log tables, it lightens the DB schema, however we could use DB triggers to populate the events table in some cases if we don't want/need to do it in the python code.

chadwhitacre commented 6 years ago

Closing in light of our decision to shut down Gratipay.

Thank you all for a great run, and I'm sorry it didn't work out! 😞 💃

adewes commented 6 years ago

@whit537 really sorry to hear this, it was a great project and as well as team behind it! I was very happy meeting you at the PyCon three years ago and I wish you good luck for your next project whatever that may be!

gratipay / gratipay.com

Clean up db schema #1549

Event Logging