ubiquity-os-marketplace / generate-vector-embeddings

0 stars 6 forks source link

Issue Dedupe #6

Closed 0x4007 closed 1 week ago

0x4007 commented 2 weeks ago

On issue.created and issue.edited we should check if there are any open issues that are x% similar within the repository. There should be two configurable percentages:

  1. A warning threshold, perhaps 75% similar as a default
  2. "Match" threshold, 95% similar default.

It isn't entirely clear to me if this should be a separate plug-in, but this does require coupling to the same vector embeddings database. Another idea is to pass in the authentication details in each vector embeddings related plugin's config.

0x4007 commented 2 weeks ago

@sshivaditya2019 rfc

sshivaditya2019 commented 2 weeks ago

This could be managed with a separate plugin if necessary. If issue_content_embeddings is enabled, we can use similarity search; otherwise, we can use other libraries like spaCy. The decision on how to design this depends on whether we are comfortable with increasing coupling between the two plugins or components.

sshivaditya2019 commented 2 weeks ago

/start

ubiquity-os[bot] commented 2 weeks ago
DeadlineSun, Sep 15, 5:54 AM UTC
Registered Wallet 0xDAba6e01D15Db560b88C8F426b016801f79e1F69
Tips:
<ul>
<li>Use <code>/wallet 0x0000...0000</code> if you want to update your registered payment wallet address.</li>
<li>Be sure to open a draft pull request as soon as possible to communicate updates on your progress.</li>
<li>Be sure to provide timely updates to us when requested, or you will be automatically unassigned from the task.</li>
<ul>
0x4007 commented 2 weeks ago

Reducing coupling is preferred just as long as it doesn't make the setup overly complicated.

sshivaditya2019 commented 2 weeks ago

Is it possible to retrieve the currently active plugins from within a plugin?

0x4007 commented 2 weeks ago

Is it possible to retrieve the currently active plugins from within a plugin?

@gentlementlegen @Keyrxng RFC

I'm assuming by parsing the current config. I'm pretty sure we have a method in our SDK for this @whilefoo do you know?

Keyrxng commented 2 weeks ago

Is it possible to retrieve the currently active plugins from within a plugin?

Yeah by parsing the private config file but I'm not sure why you'd need that to dedupe issues, all plugins right now are independent of each other if that was the concern.

I also don't understand why decoupling is a best practice for plugins, why enable plugin chaining if we should avoid it. But some plugins are going to be "coupled" and we can't help that, i.e any embeddings related feature.


If issue_content_embeddings is enabled

issue_comment_embeddings is a core plugin I think (we are covering the OpenAI bill right?) which means it'll always be active. If it's not a core plugin and/or we are not covering the bill then the partner needs to be able to define their own storage, if this happens then you'd need to parse the config file for the supabase_key used in the embeddings plugin (no guarantee that the env vars are the same across partner plugins although they'd need to have a centralized embeddings storage for us to use)

All plugins which rely on org-wide embeddings are going to need a centralized storage space per partner or we need to improve our DB handling. Otherwise any AI feature dependent on these are going to need work arounds


Using the current embeddings doesn't really fit unless there is a specific embedding just for the issue body (not tied to a comment) and if those are specifically indexable then similarity search across those embeddings should do the trick yeah. Probably need to also include something to indicate easily that it's an issue and not a PR as they shouldn't be considered for deduping I don't think


I think this task could be handled like so:

If building a new plugin:

Time estimate is a little much I think but see how it goes.

0x4007 commented 2 weeks ago

You should check the original conversation and pull instead of speculating on how it's implemented.

As I understand, each vector embedding has an ID (issue id, or comment id)

GraphQL query of contributors "closed as complete" issues' IDs

We just check those IDs' embeddings.

Keyrxng commented 2 weeks ago

You should check the original conversation similar within the repository.

My mistake I misread this as within the org and I didn't think there would be any reliance on graphql or rest. I thought we'd cover all issues not just closed as complete and the embeddings would have relevant metadata to help filtering. When working with embeddings previously such as chunking pdfs etc they'd have relevant metadata for each chunk and I was considering each of ours as one chunk.

I guess centralizing the embedding db was also sort of mentioned too so yeah my bad I'll hide the comment and go get another coffee 😂

whilefoo commented 2 weeks ago

Is it possible to retrieve the currently active plugins from within a plugin?

@gentlementlegen @Keyrxng RFC

I'm assuming by parsing the current config. I'm pretty sure we have a method in our SDK for this @whilefoo do you know?

No we don't have that method.

Since both plugins would share the same database, maybe it would be better to keep it as one plugin?

sshivaditya2019 commented 2 weeks ago

@0x4007 I think it would be better to make this as an extension of issue-comment-embeddings, like something that can be enabled if required.

To get the facts straight:

Please let me know if there are any errors or if further adjustments are needed.

0x4007 commented 2 weeks ago

All correct

Keyrxng commented 2 weeks ago

Should this plugin just be rebranded to embeddings-plugin and we can just keep all embedding-related features nice and compact? as all related features require the same vector db or will require multiple vector DBs depending on how other embedding types are stored.

A rebranding makes sense because issue_comment_embeddings suggests it's only creating them whereas embeddings-plugin sounds like it has many usecases.

Features such as issueDepude and others could be enabled/disabled per the plugin settings according to what a partner wants from their embeddings.

0x4007 commented 2 weeks ago

It's done

sshivaditya2019 commented 2 weeks ago

I think this blocked by #8, once we are able to nail down a schema, this should be good to go.

0x4007 commented 1 week ago

I decided on a schema there

ubiquity-os[bot] commented 1 week ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

sshivaditya2019 commented 1 week ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

Combining in PR #9

sshivaditya2019 commented 1 week ago

@0x4007 Does the ubiquibot-kernel, support issue.created, issue.edited and issue.deleted, event emitter types ?. I am getting TransformDecodeCheckError: Unable to decode value as it does not match the expected schema for that.

I think this is a issue, with ubiquibot-kernel, as requests are not being passed to the worker.

0x4007 commented 1 week ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

Combining in PR #9

Pulls need to be separated.

@0x4007 Does the ubiquibot-kernel, support issue.created, issue.edited and issue.deleted, event emitter types ?. I am getting TransformDecodeCheckError: Unable to decode value as it does not match the expected schema for that.

I think this is a issue, with ubiquibot-kernel, as requests are not being passed to the worker.

Try issues plural instead. Check the type definitions.

ubiquity-os[bot] commented 1 week ago

[ 609.534 WXDAI ]

@sshivaditya2019
Contributions Overview
View Contribution Count Reward
Issue Task 1 600
Issue Comment 6 9.534
Review Comment 12 0
Conversation Incentives
Comment Formatting Relevance Reward
This could be managed with a separate plugin if necessary. If &#…
2.79
content:
  p:
    symbols:
      \b\w+\b:
        count: 48
        multiplier: 0.1
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 1
        multiplier: 0.1
    score: 1
multiplier: 1
0.8 2.232
Is it possible to retrieve the currently active plugins from wit…
0.88
content:
  p:
    symbols:
      \b\w+\b:
        count: 13
        multiplier: 0.1
    score: 1
multiplier: 1
0.3 0.264
@0x4007 I think it would be better to make this as an extensio…
5.09
content:
  p:
    symbols:
      \b\w+\b:
        count: 76
        multiplier: 0.1
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 11
        multiplier: 0.1
    score: 1
  ul:
    symbols:
      \b\w+\b:
        count: 1
        multiplier: 0.1
    score: 1
  li:
    symbols:
      \b\w+\b:
        count: 3
        multiplier: 0.1
    score: 1
multiplier: 1
0.9 4.581
I think this blocked by #8, once we are able to nail down a sche…
1.33
content:
  p:
    symbols:
      \b\w+\b:
        count: 21
        multiplier: 0.1
    score: 1
multiplier: 1
0.6 0.798
Combining in PR #9
0.32
content:
  p:
    symbols:
      \b\w+\b:
        count: 4
        multiplier: 0.1
    score: 1
multiplier: 1
0.2 0.064
@0x4007 Does the `ubiquibot-kernel`, support `issue.…
3.19
content:
  p:
    symbols:
      \b\w+\b:
        count: 29
        multiplier: 0.1
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 23
        multiplier: 0.1
    score: 1
multiplier: 1
0.5 1.595
Resolves #6
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0
    score: 1
multiplier: 0
0.1 -
@0x4007 This function handles only the deduplication process, if…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 43
        multiplier: 0.2
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
A cosine similarity of 0.75 appears quite close for identifying …
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 58
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
Removed.
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 1
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
Removed Labels. Labels will not be added on issue close
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 10
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
So, for both `Match` and `Warning` threshold, th…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 16
        multiplier: 0.2
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
Added, Now it fetches the values from the `context.config …
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 8
        multiplier: 0.2
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
@0x4007 I have tried to make a few examples, let me know if hav…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 31
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
Added They will display the cosine similarity in percentage afte…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 30
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
95%: - [First Comment](https://github.com/sshivaditya2019/test-…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 22
        multiplier: 0.2
    score: 1
  ul:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
  li:
    symbols:
      \b\w+\b:
        count: 4
        multiplier: 0.2
    score: 1
  a:
    symbols:
      \b\w+\b:
        count: 11
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
Fixed that, it now returns the similar issue in both `MATCH&…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 23
        multiplier: 0.2
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
  ul:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
  li:
    symbols:
      \b\w+\b:
        count: 4
        multiplier: 0.2
    score: 1
  a:
    symbols:
      \b\w+\b:
        count: 8
        multiplier: 0.2
    score: 1
multiplier: 0
1 -
That's the first issue of that type, so its expected to not have…
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 63
        multiplier: 0.2
    score: 1
multiplier: 0
1 -

[ 41.438 WXDAI ]

@0x4007
Contributions Overview
View Contribution Count Reward
Issue Specification 1 24.5
Issue Comment 8 5.388
Review Comment 12 11.55
Conversation Incentives
Comment Formatting Relevance Reward
On `issue.created` and `issue.edited` we should …
24.5
content:
  p:
    symbols:
      \b\w+\b:
        count: 24
        multiplier: 0.1
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 5
        multiplier: 0.1
    score: 5
  ol:
    symbols:
      \b\w+\b:
        count: 1
        multiplier: 0.1
    score: 0
  li:
    symbols:
      \b\w+\b:
        count: 93
        multiplier: 0.1
    score: 1
  ul:
    symbols:
      \b\w+\b:
        count: 1
        multiplier: 0.1
    score: 0
multiplier: 3
1 24.5
@sshivaditya2019 rfc
0.36
content:
  p:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
multiplier: 1
0.1 0.036
Reducing coupling is preferred just as long as it doesn't make t…
2.11
content:
  p:
    symbols:
      \b\w+\b:
        count: 16
        multiplier: 0.2
    score: 1
multiplier: 1
0.2 0.422
@gentlementlegen @Keyrxng RFC I'm assuming by parsing the curren…
3.4
content:
  p:
    symbols:
      \b\w+\b:
        count: 28
        multiplier: 0.2
    score: 1
multiplier: 1
0.3 1.02
You should check the original conversation and pull instead of s…
5.08
content:
  p:
    symbols:
      \b\w+\b:
        count: 45
        multiplier: 0.2
    score: 1
multiplier: 1
0.7 3.556
All correct
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.2
    score: 1
multiplier: 1
- -
It's done
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 3
        multiplier: 0.2
    score: 1
multiplier: 1
- -
I decided on a schema there
0
content:
  p:
    symbols:
      \b\w+\b:
        count: 6
        multiplier: 0.2
    score: 1
multiplier: 1
- -
Pulls need to be separated. Try issues plural instead. Check th…
1.77
content:
  p:
    symbols:
      \b\w+\b:
        count: 13
        multiplier: 0.2
    score: 1
multiplier: 1
0.2 0.354
- Adding labels is out of scope. Don't do that. Close it as unpl…
2.45
content:
  ul:
    symbols:
      \b\w+\b:
        count: 1
        multiplier: 0.1
    score: 1
  li:
    symbols:
      \b\w+\b:
        count: 41
        multiplier: 0.1
    score: 1
multiplier: 1
1 2.45
Cool just needs configuration and I can merge.
0.59
content:
  p:
    symbols:
      \b\w+\b:
        count: 8
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.59
I'm assuming it all works. Code looks good.
0.65
content:
  p:
    symbols:
      \b\w+\b:
        count: 9
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.65
Needs to comment the similar issue link
0.52
content:
  p:
    symbols:
      \b\w+\b:
        count: 7
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.52
It needs to always let the user know which it thinks is similar.…
1.65
content:
  p:
    symbols:
      \b\w+\b:
        count: 27
        multiplier: 0.1
    score: 1
multiplier: 1
1 1.65
Why did you do 50%?
0.39
content:
  p:
    symbols:
      \b\w+\b:
        count: 5
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.39
No labels
0.18
content:
  p:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.18
What is going on here? Reopening issues is out of scope.
0.77
content:
  p:
    symbols:
      \b\w+\b:
        count: 11
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.77
These should be configurable values. Can you see how configurati…
1.06
content:
  p:
    symbols:
      \b\w+\b:
        count: 16
        multiplier: 0.1
    score: 1
multiplier: 1
1 1.06
Can you link your issue where you tested so we can see the resul…
0.94
content:
  p:
    symbols:
      \b\w+\b:
        count: 14
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.94
Okay it seems like you aren't following the spec again. Needs t…
1.65
content:
  p:
    symbols:
      \b\w+\b:
        count: 27
        multiplier: 0.1
    score: 1
multiplier: 1
1 1.65
Doesn't look like it in the [first one ](https://github.com/sshi…
0.7
content:
  p:
    symbols:
      \b\w+\b:
        count: 7
        multiplier: 0.1
    score: 1
  a:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.1
    score: 1
multiplier: 1
1 0.7

[ 16.765 WXDAI ]

@Keyrxng
Contributions Overview
View Contribution Count Reward
Issue Comment 3 16.765
Conversation Incentives
Comment Formatting Relevance Reward
Yeah by parsing the private config file but I'm not sure why you…
15.61
content:
  p:
    symbols:
      \b\w+\b:
        count: 355
        multiplier: 0.1
    score: 1
  hr:
    symbols:
      \b\w+\b:
        count: 3
        multiplier: 0.1
    score: 0
  code:
    symbols:
      \b\w+\b:
        count: 4
        multiplier: 0.1
    score: 1
  ul:
    symbols:
      \b\w+\b:
        count: 2
        multiplier: 0.1
    score: 1
  li:
    symbols:
      \b\w+\b:
        count: 5
        multiplier: 0.1
    score: 1
multiplier: 1
0.8 12.488
My mistake I misread this as within the org and I didn't think t…
4.97
content:
  p:
    symbols:
      \b\w+\b:
        count: 99
        multiplier: 0.1
    score: 1
multiplier: 1
0.2 0.994
Should this plugin just be rebranded to `embeddings-plugin&#…
4.69
content:
  p:
    symbols:
      \b\w+\b:
        count: 82
        multiplier: 0.1
    score: 1
  code:
    symbols:
      \b\w+\b:
        count: 6
        multiplier: 0.1
    score: 1
multiplier: 1
0.7 3.283

[ 0.32 WXDAI ]

@whilefoo
Contributions Overview
View Contribution Count Reward
Issue Comment 1 0.32
Conversation Incentives
Comment Formatting Relevance Reward
No we don't have that method. Since both plugins would share th…
0.4
content:
  p:
    symbols:
      \b\w+\b:
        count: 26
        multiplier: 0.1
    score: 1
multiplier: 0.25
0.8 0.32
0x4007 commented 1 week ago

@ubiquibot/software-development can somebody install this