ubiquity / ubiquibot

Putting the 'A' in 'DAO'
https://github.com/marketplace/ubiquibot
MIT License
17 stars 59 forks source link

Database Schema Design: October 2023 #787

Open 0x4007 opened 11 months ago

0x4007 commented 11 months ago

Overview

Schema

my_modified_backup.sql.zip

supabase com_dashboard_project_wfzpewmlyiozupulbuur_database_schemas (1)

access

Access control now has a column for JSON arrays, so that any project can set access control for fully customizable labels.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
user_id integer NOT NULL
location_id integer NOT NULL
multiplier smallint DEFAULT '1'::smallint NOT NULL
labels json

credits

Permits are optional. Perhaps we can assume XP calculations for credits without permits. Alternatively we need to think about other ways to compensate. For Gnosis Safe "permits", what data can do store instead of permits? rfc @rndquu

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
updated timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
amount numeric NOT NULL
permit_id integer

debits

No remarks

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
amount numeric NOT NULL
location_id integer

location

Represents where the event occurred. org/repo/issue/comment are all optional fields, because I anticipate that we will track only the most granular level of detail. For example, if we are tracking a comment, we will not need to track the issue or repository.

We are using node_ids because GitHub does not allow direct lookup of resources using numerical IDs. Instead, we would have to "search" for those results.

I'm unsure if it makes the most sense to consolidate to a single column for the node_id or if we should keep them separate. I'm leaning towards keeping them separate, because it will be easier to query for specific events. For example, if we want to see all events that occurred on a specific repository, we can query for node_id_repository.

I included a URL for easy auditing while we are in development.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
updated timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
node_id_organization text
node_id_repository text
node_id_issue text
node_id_comment text
url text

logs

I did not put significant thought into this, but I figured we could use it for generic remarks/comments in the database and dump everything into log_entry.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
log_entry text
location_id integer

partners

location_id is intended to save the partner organization.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
wallet_id integer
location_id integer NOT NULL

permits

Foreign keys to recognized tokens, partners, and users (beneficiary)

Just realized but not sure if I need to store the network for the transactions as well or if we should derive from token_id.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
amount numeric
nonce numeric NOT NULL
deadline numeric NOT NULL
signature character(132) NOT NULL
token_id integer
partner_id integer
beneficiary_id integer NOT NULL
transaction character(66)
location_id integer

tokens

Tokens take a lot of space. Let's make a recognized table of tokens. Perhaps we should also store the token symbol since they are unlikely to change.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
network smallint NOT NULL
address character(42) NOT NULL

users

I am debating on if we should store all their GitHub data. We don't have any planned use for it (other than generating embeddings for their solved issues) but even that isn't included in the original user schema which just had their profile information.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
wallet_id integer

wallets

No remarks.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
address character(42)
CONSTRAINT wallets_address_check CHECK ((length(address) = 42))

settlements

No remarks.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
updated timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
user_id integer NOT NULL
location_id integer
credit_id integer
debit_id integer

Old Spec (For Reference)

Tables

I think it might make sense to do them in reverse order. Here I listed parents then children, but creating the children tables in order to link to the parent tables might be easiest for implementation.

Implementation Notes

0x4007 commented 11 months ago

Debating on this but still thinking about when we eventually make a command to allow repository maintainers to override the contributor's XP: we will need to create a new column in the settlements table. xpId or something. It will be a foreign key to an "XP Modifiers" table. There we can store all the details around that modification. For example: id, created time, updated time, amount.

We can exclude modifiers from the global leaderboard calculation (globalXP) so that its more difficult to exploit (i.e. collusion between an org and a contributor). We could even make a whistleblower incentive to make sure that people police each other for exploiting the XP system possibly.

Since it would be available in the settlements table, that means the penalties parent table can associate the user, orgId, repoId, issueId and commentURL to see exactly what happened.

wannacfuture commented 11 months ago

/start

ubiquibot[bot] commented 11 months ago

Too many assigned issues, you have reached your max of 2

seprintour commented 11 months ago

/start

ubiquibot[bot] commented 11 months ago

Deadline Fri, 22 Sep 2023 05:23:00 UTC
Registered Wallet 0x3623338046b101ecEc741De9C3594CC2176f39E5

Tips:

seprintour commented 11 months ago

@pavlovcik There was already a Permit table, guess we should use that instead..

It has all of the fields here and more

0x4007 commented 11 months ago

@pavlovcik There was already a Permit table, guess we should use that instead..

It has all of the fields here and more

Thanks for highlighting this. The only concern I have with the current implementation is that it has an issue ID associated with every permit, but I want to allow the flexibility for us to issue permits without necessarily associating them to issue IDs.

I believe that my specification allows for more flexibility, but I also do not claim to be a database architecture expert.


I also just realized that ChatGPT jumbled up some of the rows. It left some permit specific data inside of the credits table.

seprintour commented 11 months ago

I also just realized that ChatGPT jumbled up some of the rows. It left some permit specific data inside of the credits table.

Yea, network, signature and owner is permit data.

Previous permit table contains repository data and issue id to pinpoint the exact origin of the permit. Are you saying we don't need that anymore.

Because none of the tables here refers to a repository (origin) of the payment

seprintour commented 11 months ago

Rfc

0x4007 commented 11 months ago

The parent table does. https://github.com/ubiquity/ubiquibot/issues/767

seprintour commented 11 months ago

The parent table does. #767

Okay then, this issue entail just creating the tables, moving permit data to the new table. Is creating functions to add these data to DB included? @pavlovcik

0x4007 commented 11 months ago

The bot should be stable after the database is changed out so yes it includes modifying all the functions that interact with the database.

0x4007 commented 11 months ago

/ask how long do you estimate that this task would take, in hours?

ubiquibot[bot] commented 11 months ago
! Error: This model's maximum context length is 16385 tokens. However, you requested 20407 tokens (4023 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.
0x4007 commented 11 months ago
! Error: This model's maximum context length is 16385 tokens. However, you requested 20407 tokens (4023 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.

@Keyrxng time for compression/prioritization? Not a great first real world attempt lol.

Prioritization order:

  1. Current issue specification
  2. Linked issue specification (in order of linked, the first link taking higher priority than the next link)
  3. Current issue conversation
  4. Linked issue conversations (same ordering system)

We should use a tokenization estimator to know how much we should exclude.

seprintour commented 11 months ago

@pavlovcik I'm adding penalties to this too?

I see the penalty issue was rolled onto this

0x4007 commented 11 months ago

Most likely but let me just get a little time to refine this spec. Sorry I realized after filing the issue that there were some mistakes in my specification. I might break this up into separate tasks.

0x4007 commented 11 months ago

/ask how long do you estimate that this task would take, in hours?

ubiquibot[bot] commented 11 months ago
! Error: This model's maximum context length is 16385 tokens. However, you requested 21257 tokens (4873 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.
0x4007 commented 11 months ago

@seprintour I broke it down into four subtasks

seprintour commented 11 months ago

@seprintour I broke it down into four subtasks

Okay, credit depends on permit (because permit is going to be modified and it needs to be linked to the permit id after modification)

Settlements depends on credit and debit (If its not merged in this order, the supabase migration will fail to run automatically)

0x4007 commented 11 months ago

I noted that as well on the Tables section.

seprintour commented 11 months ago
  • Debits Table Schema #809

I also think the amount for debit and credit should be a float or big integer if we are converting decimal to bigint because permit amount is not always in whole numbers

0x4007 commented 11 months ago

Lets chat under that issue instead.

Keyrxng commented 11 months ago
4023 in the messages, 16384 in the completion

Same as other shout out, you've set the output to be the maximum token amount for that model. max_tokens represents output so you gotta drop the token limit to half or 2/3.

the actual context fed into GPT was only 4k tokens

0x4007 commented 11 months ago

I set to 8000 max tokens in the org config and it silent crashes

0x4007 commented 11 months ago

supabase com_dashboard_project_wfzpewmlyiozupulbuur_database_schemas (1)


just realized that ids are useless. need node_ids and graphql api

0x4007 commented 11 months ago
image
0x4007 commented 11 months ago
Screenshot 2023-09-27 at 09 12 18
0x4007 commented 11 months ago

I finished the schema but i still need to verify the node id situation. I might need to change the ids under 'location' table to track references on github e.g. org/repo/issue/comment ids final-database.zip

image
0x4007 commented 11 months ago

access

Access control now has a column for JSON arrays, so that any project can set access control for fully customizable labels.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
user_id integer NOT NULL
location_id integer NOT NULL
multiplier smallint DEFAULT '1'::smallint NOT NULL
labels json

credits

Permits are optional. Perhaps we can assume XP calculations for credits without permits. Alternatively we need to think about other ways to compensate. For Gnosis Safe "permits", what data can do store instead of permits? rfc @rndquu

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
updated timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
amount numeric NOT NULL
permit_id integer

debits

No remarks

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
amount numeric NOT NULL
location_id integer

location

Represents where the event occurred. org/repo/issue/comment are all optional fields, because I anticipate that we will track only the most granular level of detail. For example, if we are tracking a comment, we will not need to track the issue or repository.

We are using node_ids because GitHub does not allow direct lookup of resources using numerical IDs. Instead, we would have to "search" for those results.

I'm unsure if it makes the most sense to consolidate to a single column for the node_id or if we should keep them separate. I'm leaning towards keeping them separate, because it will be easier to query for specific events. For example, if we want to see all events that occurred on a specific repository, we can query for node_id_repository.

I included a URL for easy auditing while we are in development.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
updated timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
node_id_organization text
node_id_repository text
node_id_issue text
node_id_comment text
url text

logs

I did not put significant thought into this, but I figured we could use it for generic remarks/comments in the database and dump everything into log_entry.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
log_entry text
location_id integer

partners

location_id is intended to save the partner organization.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
wallet_id integer
location_id integer NOT NULL

permits

Foreign keys to recognized tokens, partners, and users (beneficiary)

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
amount numeric
nonce numeric NOT NULL
deadline numeric NOT NULL
signature character(132) NOT NULL
token_id integer
partner_id integer
beneficiary_id integer NOT NULL
transaction character(66)
location_id integer

tokens

Tokens take a lot of space. Let's make a recognized table of tokens. Perhaps we should also store the token symbol since they are unlikely to change.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
network smallint NOT NULL
address character(42) NOT NULL

users

I am debating on if we should store all their GitHub data. We don't have any planned use for it (other than generating embeddings for their solved issues) but even that isn't included in the original user schema which just had their profile information.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
wallet_id integer

wallets

No remarks.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT now() NOT NULL
updated timestamp with time zone DEFAULT now() NOT NULL
address character(42)
CONSTRAINT wallets_address_check CHECK ((length(address) = 42))

settlements

No remarks.

Column Name Type/Constraints
id integer NOT NULL
created timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
updated timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL
user_id integer NOT NULL
location_id integer
credit_id integer
debit_id integer

supabase com_dashboard_project_wfzpewmlyiozupulbuur_database_schemas (1)

0x4007 commented 11 months ago

supabase com_dashboard_project_wfzpewmlyiozupulbuur_database_schemas (2) 1695844219.zip