python-discord / site

pythondiscord.com - A Django and Bulma web application.
https://pythondiscord.com
MIT License
638 stars 127 forks source link

Redesign infractions schema #752

Open MarkKoz opened 4 years ago

MarkKoz commented 4 years ago

Current problems

The current design has some limitations. It cannot store a separate actor, timestamp, or reason for pardoning infractions. For the reason field, the current workaround used by Big Brother is to prepend the reason with 'Watched: ' or 'Unwatched: ' to distinguish them. Furthermore, the design cannot keep track of edits made to infractions.

New Design

Action table

Create a new table to store "action" records. Actions can include making an infraction active (i.e. creating it), making it inactive, voiding it, editing it, etc. Each record will have its own actor, reason, and timestamp. The reason field may be optional for certain actions, but that is to be determined.

Edits

Storing edits will make the table behave like an audit table of sorts. It will track the entire history of an infraction along with who is responsible for the changes.

status field

It may be redundant since the status could be determined by doing a query for the most recent action. Furthermore, an expired infraction could be determined by comparing expired_at to the current time. However, in both cases, using the status field is faster and more convenient.

Voiding

It is undesirable for some infractions to be displayed due to a mistake in creating them among other reasons. The void feature will give such infractions a special status to distinguish them from normal inactive infractions, which will enable clients (i.e. the bot) to normally hide the infractions from infraction totals, infraction searches, etc. See python-discord/api#6.

Schema

Still a WIP. The user table is just a dummy to show the relationships.

bild

Edit History

I'm not sure how to store edit history given the different fields with various data types. Maybe serialising them to strings is fine, since that will be done eventually anyway when the client displays the history. Each field edit could be a separate record and associated with the same action or all changes could be stored as JSON in a single record. The latter would not necessarily require a separate table.

Implementation

I realise this has to be done in Django, but I would like to enforce more constraints on the database level where possible so I wrote some rough SQL here. Will also need a partial unique index to ensure only one active and inactive action per infraction. There's some things that don't seem possible to enforce on the db level e.g. disallow an inactive action if the infraction is a note or warning.

Django added enumeration types in 3.0, but, as seen in the SQL, I'd like to explore going further and using enums at the database level too. I found a blog post which has implemented this but I have not looked into it further.

MarkKoz commented 4 years ago

The edit history is proving to be complicated to design a schema for. Maybe not worth doing - no one was really requesting this feature anyway, just thought I may be able to squeeze it in.

scragly commented 3 years ago

After having a thought about this proposal, I came up with the following possible implementation (some details pending).

Schema image

Each action taken related to an infraction will be given a new entry in infraction_activity with a relevant action type, such as: creation, pardon, expiry_edit, reason_edit, voiding, unvoiding

As we don't store expiry values in activities, we could for the sake of clarity append a before and after value to the end of the activity reason, such as: "appeal approved, mute dropped to 1 hour [expires_at: 86400 -> 3600]"

I'd probably prefer a more elegant solution regarding this so I'll think more on that, but it's not really useful to create an entire column just for that one type to use, hence the avoidance.

Everything else though can have it's before/after state inferred based on historical entries, avoiding the need for additional fields.

The columns active and void could potentially be merged as a status column with a tribool (true/false/null), but considerations need to be made regarding index performance as the two fields are treated quite differently:

Performance testing may be in order to find which is best between:

MarkKoz commented 3 years ago

Regarding active, we do need a partial unique index on that column together with the user to ensure they only have one active infraction of that type. Granted, this could be implemented at the API-level, but I think it's more efficient to rely on a constraint failure than to pre-emptively query to ensure there isn't an active infraction already. Also, doing stuff at the db-level is nice when possible, even if we don't necessarily interact with the DB outside of DRF (select queries for stats excluded).

Furthermore, keeping void and active separate means we need to either:

  1. Set void to false when active is true
  2. Add void to the partial unique index as well to exclude voided infractions

Keep these points in mind when gauging performance.

jchristgit commented 1 year ago

This is a very cool idea, Mark. Thank you for creating it.

I realise this has to be done in Django, but I would like to enforce more constraints on the database level where possible so I wrote some rough SQL here. Will also need a partial unique index to ensure only one active and inactive action per infraction. There's some things that don't seem possible to enforce on the db level e.g. disallow an inactive action if the infraction is a note or warning.

I think the recent Django versions allow you to manage this as part of the Django models themselves.

A small question: Do we need a primary key on the action table? I know it's good practice, but it's a waste of space, or is it? I don't see a need to address them directly.

jchristgit commented 1 year ago

@MarkKoz do you still want to tackle this issue or would you prefer if someone else took it over?

shtlrs commented 1 year ago

@MarkKoz I'll gladly have a go at this if you decide to not do it yourself.

MarkKoz commented 1 year ago

If you're keen to work on it, you're welcome to. However, I do recall having some strong opinions on how to approach this, so I'd like to try to stay in the loop. We should decide on a final schema design before implementation begins.

shtlrs commented 1 year ago

Sure thing.

I have some questions that come off the bat

It may be redundant since the status could be determined by doing a query for the most recent action. Furthermore, an expired infraction could be determined by comparing expired_at to the current time. However, in both cases, using the status field is faster and more convenient.

Why would we query the most recent action here ? Wouldn't it depend also on the type of action executing ?

Meaning, if I mute someone, the mute infraction is now active. As long as the scheduled pardoning of it or a manual one (and potentially a voiding) do not take place, the status won't change with updates to: the reason/expiration date.

Edit History

I'm not sure how to store edit history given the different fields with various data types. Maybe serialising them to strings is fine, since that will be done eventually anyway when the client displays the history. Each field edit could be a separate record and associated with the same action or all changes could be stored as JSON in a single record. The latter would not necessarily require a separate table.

Could you elaborate more on this ? I'd actually want to question the need of a history table ? Why not just have the actions table serve as a history that'll be queried by the bot. We could define the ordering (recent or oldest first) and then display them in a paginated embed like we do for infractions.

For the voiding: --> Do we only allow the person who issued the infraction to void it, or can anyone with a role >= mod do it ? --> Do we want to add the ability to unvoid an infraction ? Or is is an action that cannot be undone ?

Schema

I think the one scragly mentioned looks good, with some potential rewording of table/column names. I'm in favor of merging the active & void columns into a status one.

For the indexes, would it be a good option to have 2

  1. One unique index on infration type, infraction status & user
  2. A partial index for the unvoided infractions

So maybe something like

CREATE UNIQUE INDEX unique_active_infractions
ON infractions (type, user)
WHERE status = 'active';

And

CREATE INDEX unvoided_infractions
ON infractions (user)
WHERE status <> 'void';
MarkKoz commented 1 year ago

Why would we query the most recent action here ? Wouldn't it depend also on the type of action executing ?

Yes. What I likely meant is to query for the most recent action filtered by type. To check if an infraction is active, we can query for actions filtered by that infraction's ID and by action types that would affect the status. For example, if we just queried the most recent action, we may get an action for editing the description, and we'd be unable to deduce the status. Thus, we'd exclude the "edit description" action from our query (and any other irrelevant actions).

In retrospect, we likely need the status column anyway for use in indices.

I'd actually want to question the need of a history table ? Why not just have the actions table serve as a history that'll be queried by the bot.

I believe what I meant is storing the diff somehow. The infractions_activity table can tell us the history of actions that occurred, but not precisely how the data was modified (e.g. the value of a description before and after the action). This was perhaps an ambitious goal, and there wasn't an ask for this feature from moderators. If we want to explore this feature further we should make sure there is actually a demand for it.

For the voiding: --> Do we only allow the person who issued the infraction to void it, or can anyone with a role >= mod do it ? --> Do we want to add the ability to unvoid an infraction ? Or is is an action that cannot be undone ?

How do you foresee the answering to these questions affecting the schema design? I only think this would affect constraints. In any case, these are questions for the moderation team to answer. We should work with the moderation lead to gather requirements from the team.

For the indexes, would it be a good option to have 2

That makes sense. Keeping them separate makes things awkward since they sort of depend on each other, as I pointed out in a previous comment. Performance tests would be nice if you want to tackle that, but I don't think it's strictly necessary to have that data to proceed.

shtlrs commented 1 year ago

Let me start off by explictly saying that I think an expires_at column is also needed for the actions table. The current timestamp column should only state when the action was executed for historical purposes, and we obviously need to record the nex expiration date, if provided.

We could also add a jump_url like we currently have for infractions, in case someone needs more context on what happened, etc.

In retrospect, we likely need the status column anyway for use in indices.

Alright, we're both on the same page for this then.

I believe what I meant is storing the diff somehow. The infractions_activity table can tell us the history of actions that occurred, but not precisely how the data was modified (e.g. the value of a description before and after the action). This was perhaps an ambitious goal, and there wasn't an ask for this feature from moderators. If we want to explore this feature further we should make sure there is actually a demand for it.

I'm not sure that's quite needed, since the actions themselves will hold that information and can be easily understood when looked at in a timely manner. Here's how I see it

Column Action1 Action2 Action3
action create update void
actor shtlrs MarkKoz mbaruh
reason reason1 reason2 reason3
expiry exp1 exp1 N/A
timestamp t1 t2 t3

So I think it's understandablel that I first created the infraction with reason1 with an expiration of exp1 at t1 Then you only updated the reason to reason2. Then boris voided it for reason3

At the end, it's like you said, we'd need to check with the mod team. (@mbaruh I'm tagging you since you're the lead)

How do you foresee the answering to these questions affecting the schema design? I only think this would affect constraints. In any case, these are questions for the moderation team to answer. We should work with the moderation lead to gather requirements from the team.

They don't really influence the schema. It would just be a CHECK constraint like you said. So we'll discuss this with the mod team as well like you suggested.

That makes sense. Keeping them separate makes things awkward since they sort of depend on each other, as I pointed out in a previous comment. Performance tests would be nice if you want to tackle that, but I don't think it's strictly necessary to have that data to proceed.

Did you mean that it doesn't make sense ? Since later on you state that keeping them separate is awkard. In any case, this won't affect the main schema in the end, so we can start with the implementation once we agree on it then we can add necessary indexes for optimization.

mbaruh commented 1 year ago

If this is alive again I'll try to stay in the loop as well. Either way please check with me before starting to work on this so I can make sure the plans address the needs of the mod team.

mbaruh commented 1 year ago

I'm still reading, but before anything I'd like to put aside the matter of voiding. It's a new type of action that requires separate discussion (it was discussed several times in the past and has been a bit controversial), and seems out of scope for this redesign. With the right design, adding a new type of action should be fairly simple anyway.

mbaruh commented 1 year ago

A few questions that come up to me while looking at the schema:

To address both of these it might make sense to have an array field in the infraction table which lists all of the actions in sequencial order.

shtlrs commented 1 year ago
  • When the bot queries an infraction, will it automatically get all of the actions? in the current design, without changing the serializer the bot would need to query once for the infraction, and a second time for the infraction actions.

That actually just depends on what we want to do, it's a matter of changing the serializer here. I imagined it'll only query the infraction, and each infraction that has gone through edits will have a flag saying it did. And then we can just query all the actions of that infraction by its id. We could make it always bring them along of course, but we don't have where to display them and I don't think we'd want to change the UX.

  • How do you know the order of the actions? do you need to sort by timestamp in either the site or the bot?

The order is indeed determined by the timestamp. I don't think sorting on either end will make a difference since i don't image an infraction having more than 10 actions at the extreme worst case. It will basically be an option in the bot command (that default to either oldest_first or newest_first ), and we can decide on where the sorting happens later.

It's somewhat unclear to me how the action is going to be stored. For example if we set a new expiration date, we'll have an EXPIRATION_EDIT action (or some other appropriate name), but where does the action say what the new expiration is? In Scragly's comment it seems like it'll be stored in the reason. I wonder if we have a better way? We could add two nullable fields to the action for the editable fields (reason and expiration), but then we run into the issue of not being able to use the value of null for anything else, e.g making the infraction permanent. The JSON idea in the OP might be the way to go.

In my last comment (the one before this), I did say that we'd need the expires_at column to be added to the actions table. Some columns will always be filled with non null values, like the actor, timestamp and the action. The values that have been edited will be clear in the row (it's either reason or expiration date anyways) since the ones that didn't will be null. I don't see how this will make an infraction permanement as that's determined by the expiration date of the infraction, and not the action itself. So for example, the actions table will look like this

Action Actor Reason Expiration Timestamp
create shtlrs reason1 exp1 t1
edit MarkKoz null exp2 t2
edit mbaruh reason2 null t3

And the infraction would look like

Actor Reason Expiration Edited
shtlrs reason1 exp1 false
Actor Reason Expiration Edited
shtlrs reason1 exp2 true
Actor Reason Expiration Edited
shtlrs reason2 exp2 true

The actor will remain the same obviously, and the person who made the edit will be displayed in the action, not the infraction.

There have been some changes to the schema since this issue was opened, so we should of course maintain feature parity (e.g the new URL field).

Sure, that's what I had also proposed as well.

mbaruh commented 1 year ago

it's a matter of changing the serializer

It can be a matter of changing the serializer, I was wondering if we could do it differently. But alright, I understand your answer.

The values that have been edited will be clear in the row (it's either reason or expiration date anyways) since the ones that didn't will be null

This doesn't align with the example you gave afterwards. In the example each action contains the reason and expiration so far, rather than the new values they were changed to. Meaning that if you didn't edit the reason, then the action will contain the reason of the previous action. This will work, but:

  1. Somewhat duplicates information, since if you never edit the reason then every action will contain the same reason even though nothing happened to it.
  2. The values that have been edited are not clear, since you have to compare the values to the ones in the previous action to know what was edited.
  3. It seems a little murky in terms of logic. The actions table represents what happened, but you're mixing in the state of the infraction so far (and only for some of the fields, since the "active" state can also change).

If you were to go with the idea you wrote in the quote, you would come across the issue of permanent infractions. As an aside, you don't need an "edited" column, since it's already covered by the type of action.

shtlrs commented 1 year ago

This doesn't align with the example you gave afterwards. In the example each action contains the reason and expiration so far, rather than the new values they were changed to. Meaning that if you didn't edit the reason, then the action will contain the reason of the previous action.

Forgive me, but I'm not sure how you're seeing that. Here's the action table Action Actor Reason Expiration Timestamp
create shtlrs reason1 exp1 t1
edit MarkKoz null exp2 t2
edit mbaruh reason2 null t3
edit wookie reason3 exp3 t4

The values contained here are the values they were changed to.

  1. In the first row, which is the create action, reason1 and exp1 are the values first introduced by the invoker.

  2. In the second row, mark didn't change the reason, which is why it's null. However, he did change the expiration date, which is why we see it as exp2

  3. In the third rown, you only changed the reason, thus the reason2 value. And since the expiration was left untouched, it remains at null.

  4. I've added another row where wookie changes both reason and expiration, who now take the values reason3 and exp3 respectively.

  1. The values that have been edited are not clear, since you have to compare the values to the ones in the previous action to know what was edited.

Sure, might not be super clear as I had imagined that being done bot side. Which is the point Mark and I discussed. So let me ask the question again: Do we want a diff feature here (e.g. display a before & after state) ? Or just state what changed ? Because I had imagined sthg like (Excuse the poor style as I was going for speed here) image image image

  1. It seems a little murky in terms of logic. The actions table represents what happened, but you're mixing in the state of the infraction so far (and only for some of the fields, since the "active" state can also change).

Where do you see this mixing of the state ?

As an aside, you don't need an "edited" column, since it's already covered by the type of action.

How is it covered by the type of action ? Maybe I explained this wrong. But what I was trying to say is that, when listing the infractions, we don't know whether infraction X has actions related to it other than the created one (Unless we bring all actions through the serializer, and add an edited key to the returned json based on whether the length of the list of actions is > 1) . If an infraction has been just been issued, it won't be edited. Now, if someone does edit it, that column needs to change, and remain like that forever. That way we can list the actions (bot-side) when we want extra details on what happened.

mbaruh commented 1 year ago

Ok, looks like I mixed up the table of actions and the table after that which showed the infraction after the corresponding action. Forget most of what I said.

But we're going back to the reason why I said we can't use this approach: A null expiration already has a meaning. It means the infraction was made permanent. So in the table in your latest comment, the third row is invalid.

shtlrs commented 1 year ago

It's only null in the actions table, which has no influence on the expiration date in the infractions table.

It's null just to say that it hasn't changed.

Maybe it's a poor value that I've chosen , as a permanent infraction has a null value and then we won't distinguish between both.

mbaruh commented 1 year ago

Yes, that's the problem. You can't distinguish between a non-edit and an edit to permanent that way.

mbaruh commented 1 year ago

And it does have an effect on the infractions table, if what you send from the bot is a POST request of a new action.

shtlrs commented 1 year ago

And it does have an effect on the infractions table, if what you send from the bot is a POST request of a new action.

Can you elaborate? I was mostly thinking of a trigger at the db level upon insertion / update to the infraction table.

It's just a raw idea, not sure of how feasible it is for our case or whether we even want to do it that way.

shtlrs commented 1 year ago

The follwing are highlights of the different approaches Zig and I had discussed, which started here. @mbaruh will correct me if i'm wrong on any of these points.

Having null values in the actions table isn't a good idea. As that leads to a loss of context/meaning. Example: We make a permanent infraction, we make it non permanent (we set the value for expiration_date), and then make it permanent again. This will result in the last edit having a null value for the expiration_date --> We don't whether that means it hasn't been changed or whatever it became permanent.

The following are the approaches to solve this

  1. Have a column of type json (let's call it changed_fields for now) that will represent the fields that have been changed, example if we change the reason only, the value would be {reason: "new reason"}, if we change the expiration_date, the value would be {expiration_date: "new_value_here" } , or both obviously. The casting to the right types in the infraction table will happen at the endpoint level.

  2. We change the definition of what permanent means, meaning that a perma ban is one that has a timestamp of 0 (epoch), or one that's so far in the future to be even reachable.

  3. Add another column to the actions table that when not null, it'll give a meaning that the action infraction is permanent or not

  4. Change the structure of the actions table to include a before and after state, but also a correlation_id of a sort, meaning that if someone updates the expiration_date and the reason it'll result into 2 different entries in the actions table (and thus 2 different action ids), but will have the same correlation_id that'll be later used for aggregation. This implies that the whole update on the infraction and infraction_action table will be transactional.

The cleanest one in terms of implementation but also the less complicated is n°1. We will need to agree on one of the options here, or any other option we haven't mentioned/thought of before we can move forward with this.

mbaruh commented 1 year ago

Can you elaborate? I was mostly thinking of a trigger at the db level upon insertion / update to the infraction table.

We also discussed this in chat, and seemed to agree that it'll be more natural to do something like editing and deactivation by POSTing a new action, which will make the site make the right change in the infractions table.

mbaruh commented 1 year ago

I also see that the topic of whether to use an ID as the primary key in the actions table, or some combination such as infraction ID and timestamp.

One thing I envisioned we would be able to do with these changes is to resend an earlier version of the infraction. For example, if we added internal context to the infraction reason, we would be able to use !infraction resend to resend the original reason without the added context, by specifying the action which created the desirable version. If it's possible to implement conveniently without having action IDs then great, but it's something to keep in mind.

mbaruh commented 1 year ago

re https://github.com/python-discord/site/issues/752#issuecomment-1482122887 After some more conversation it might make more sense to go with option 3. It will give us more control over the information and make migrations saner.

MarkKoz commented 1 year ago

@shtlrs

For the indexes, would it be a good option to have 2

Did you mean that it doesn't make sense ? Since later on you state that keeping them separate is awkard. In any case, this won't affect the main schema in the end, so we can start with the implementation once we agree on it then we can add necessary indexes for optimization.

Sorry, I quoted the wrong thing. I meant to quote the following:

I'm in favor of merging the active & void columns into a status one.

MarkKoz commented 1 year ago

After some more conversation it might make more sense to go with option 3. It will give us more control over the information and make migrations saner.

I'm also in favour of option 3. I'm not totally against JSON, but it's something I'd have to give more thought to. Initial impression is that using JSON data can be an anti-pattern for relational database depending on how we need to use/query that data.

I'm assuming this new field will be a boolean field, e.g. is_permanent (if you had a different idea for modelling this please correct me). There are a few awkward things about it:

  1. How do we deal with updating a permanent infraction to be non-permanent? Is this field always going to be null in such case, or will we require it to be set to false? The former is awkward because then the field is only ever null or true, and the latter is awkward because the presence of an expiry timestamp in the action is enough to infer that it's no longer permanent.
  2. We'll need extra validation to disallow an expiration timestamp in the action when this new field is non-null. It's mildly confusing that our design would allow such combination to occur. Not a big deal since we already have similar validations where we disallow certain infraction types to be permanent.
mbaruh commented 1 year ago

@MarkKoz An alternative could be an expiry_edited field. It might be a bit redundant when the expiry is edited to a new date, but when the expiry is set to null, this field would then be true if the null means "permanent" or false if it was just not edited at all. But if we go with this direction it would probably be one of those three options, as awkward as they may be.

shtlrs commented 1 year ago

@mbaruh

but when the expiry is set to null, this field would then be true if the null means "permanent" or false if it was just not edited at all.

I'm not sure I follow what you mean by if the null means "permanent" here, isn't that what null always signifies for an expiry ?

How would this work when the expiry changes, but to a valide date time ? It would be confusing for it to be true since that's the value we'd use for permanent infraction changes as well if I understood you correctly. And having it to false or null is also somehow not super clear.

It seems our only problem comes from the fact that null has a meaning for the expiration_date.

How terrible of an idea would it be to simply make a before and after columns just for that field ? I know that this doesn't propagate well with any additional field that we might be adding in the future when these field could have particular meaning when assigned the value null.