Open kamaalsultan opened 1 year ago
Similar issue (Test Issue_1) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/1. Similarity is about 100%
Similar issue (Test issue_2) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/2. Similarity is about 100%
Similar issue (Test issue_3) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/3. Similarity is about 100%
Similar issue (Test issue_4) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/4. Similarity is about 100%
Similar issue (Test issue_5) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/5. Similarity is about 100%
Similar issue (TI_6) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/6. Similarity is about 100%
Similar issue (TI_7) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/7. Similarity is about 100%
Similar issue (TI_*) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/8. Similarity is about 100%
Similar issue (ti10) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/10. Similarity is about 100%
Similar issue (TI_9) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/9. Similarity is about 100%
Similar issue (t11) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/11. Similarity is about 100%
Similar issue (t12) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/12. Similarity is about 100%
Similar issue (13) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/13. Similarity is about 100%
Similar issue (Test Issue_14) found at https://github.com/ByteBallet/santa-bringyouwishes/issues/14. Similarity is about 100%
We could do things without a vector database in order to simplify things, here's how:
A new issue is posted We ask ChatGPT to extract a word list of the most "important" (i.e. unique adjectives?) words (for example, when I wanted to find this issue, I searched for the term "duplicate") We search the repository for all issues with the important words We go from highest issue number (most recent) and read the specification. If >80% confidence, stop the search and link back to it with a warning saying that it's likely to be a duplicate. This approach might be a little more brittle (rate limits) but we won't have to worry about the database which should make implementation and maintenance much easier.
Original Specification (for reference only, do not use) Overview
The idea is to have:
an event handler for creating a new issue that passes the issue specification to ChatGPT (asynchronously/slowly is fine) have a cache "vector database" of issue similarity within the same repository only have a ubiquibot-config property with a issue-similarity-confidence-threshold: float Then with an issue similarity confidence threshold of e.g. 0.8 that means that if the bot is 80% confident it's a redundant issue. It will then post a comment that will explain it is X% confident that this is a redundant issue and backlink the redundant issue in question.
Remarks This should be broken down further but will put a broad Time: <1 Week for anybody particularly motivated to get started with it. I'm not sure about the relationship with ChatGPT and the vector database, but from what I understand, vector database is good for evaluating similarity between things and ChatGPT is a good interpretation engine. Context Duplicate