ubiquity-os-marketplace / generate-vector-embeddings

0 stars 6 forks source link

Fix: Database Schema #9

Closed sshivaditya2019 closed 1 week ago

sshivaditya2019 commented 2 weeks ago

Resolves #8

0x4007 commented 2 weeks ago

Why did you switch to main from development that doesn't seem right

sshivaditya2019 commented 2 weeks ago

Why did you switch to main from development that doesn't seem right

Fixed

sshivaditya2019 commented 1 week ago

@0x4007 Updated the schema, comments use voyageai for embeddings now.

github-actions[bot] commented 1 week ago

Unused files (4)

src/handlers/add-issue.ts, src/handlers/delete-issues.ts, src/handlers/issue-deduplication.ts, src/handlers/update-issue.ts

sshivaditya2019 commented 1 week ago

@0x4007 Could you please check the updated changes ? Have removed OpenAI adapter, corrected the syntax and updated the dimension for the embedding column ?

0x4007 commented 1 week ago

You'll need to cherry pick changes (easy to do with a git UI) and deliver to each task separately. The reason why we do this is for audits and review. When there is a problem some time in the future, the developers git blame and review the full conversation and pull for context on debugging. Combining multiple deliverables in a single pull complicates this process and is forbidden.

Make sure your CI is passing and make sure to post QA links (show it working on your own org/repos)

sshivaditya2019 commented 1 week ago

You'll need to cherry pick changes (easy to do with a git UI) and deliver to each task separately. The reason why we do this is for audits and review. When there is a problem some time in the future, the developers git blame and review the full conversation and pull for context on debugging. Combining multiple deliverables in a single pull complicates this process and is forbidden.

Make sure your CI is passing and make sure to post QA links (show it working on your own org/repos)

I have removed the changes for the other task. CI Should be passing now.

https://github.com/user-attachments/assets/b79bdf2b-58eb-4618-b9eb-ce2376dcf313

0x4007 commented 1 week ago

Where's the issue body? You should probably make another table and capture them there. Also might need to associate the comment to the issue with a foreign key

sshivaditya2019 commented 1 week ago

Where's the issue body? You should probably make another table and capture them there. Also might need to associate the comment to the issue with a foreign key

Right now issue body is in the payload object, this is jsonb, which can be queried using sql.

image

sshivaditya2019 commented 1 week ago

So to sum up, you require two tables one for comments and one for issues.

The schema for comments is id, plaintext, embedding, payloadobject, type, created_at, modified_at, issue_id and the schema for issues is id, plaintext, embedding, payloadobject, type, created_at, modified_at

Is that right ?

0x4007 commented 1 week ago

I'm bad at deciding this sort of thing. Let's go with your suggestion for now except the column should be renamed just to payload or context

sshivaditya2019 commented 1 week ago

@0x4007 Have Added Two Separate Tables as per the schema mentioned in the previous comment.

https://github.com/user-attachments/assets/ceb63573-4a88-47fb-8c7a-f26bd3adef80

0x4007 commented 1 week ago

@0x4007 Have Added Two Separate Tables as per the schema mentioned in the previous comment.

https://github.com/user-attachments/assets/ceb63573-4a88-47fb-8c7a-f26bd3adef80

Seems mostly good but I didn't see all the headers on the first table. Does it have the payload? Just try and normalize the columns as much as you can.

Otherwise not sure why you added the type column if they are separated by type per table.

sshivaditya2019 commented 1 week ago

@0x4007 Have Added Two Separate Tables as per the schema mentioned in the previous comment.

Screen.Recording.2024-09-13.at.1.01.47.AM.mov

Seems mostly good but I didn't see all the headers on the first table. Does it have the payload? Just try and normalize the columns as much as you can.

Otherwise not sure why you added the type column if they are separated by type per table.

Yes, it includes a payload. I've retained the type column to distinguish between bot comments and bot issues. However, I can remove it if needed, as implementing that feature might be beyond the scope of this issue.

0x4007 commented 1 week ago

Yes I think its unnecessary if they are separated by type on different tables.

sshivaditya2019 commented 1 week ago

Removed the type from schema. Payload is stored for both comments and issues.

https://github.com/user-attachments/assets/0a230a7e-69a5-4843-b33b-c03b1d1ce620

0x4007 commented 1 week ago

Thanks for the thorough QA. You don't need to make a new video on every change! But generally when opening a pull or making major changes a video is useful.

The last idea I have (sorry for the last second changes) is to have two columns for the text plaintext and markdown

This is so we can easily do testing in the near future to compare the performance of the plaintext and markdown versions of each comment when reasoning with the LLMs. However the new ChatGPT model o1 just came out today and is supposed to be very good at reasoning, with built-in chain-of-thought reasoning capabilities. This makes me more optimistic about working with the raw markdown source code, as it provides more context (i.e. blockquotes)

Once that is implemented, you don't need to make a QA video. Just let me know and we can merge. Do that for both tables please.

sshivaditya2019 commented 1 week ago

@0x4007 I have added markdown and plaintext column. o1 has a great reasoning capabilities, I think ChatGPT-Plus members have access to it already, probably will take a lot of time to be GA and be available on API.

0x4007 commented 1 week ago

I think ChatGPT-Plus members have access to it already, probably will take a lot of time to be GA and be available on API.

I think that only tier5 subscribers can use right now via API. I believe that we are tier4.