ubiquity-os-marketplace / text-vector-embeddings

0 stars 7 forks source link

Similar Issues Adjustments #25

Closed 0x4007 closed 1 month ago

0x4007 commented 1 month ago
          > This issue seems to be similar to the following issue(s):

This ran extremely fast.

@sshivaditya2019:

  1. can you make sure that this includes the www. prefix so it does not "reference" or "link back" to the similar issue? I'm also curious why it thinks its a similar issue.
  2. Also I think it should probably only check within the same repo, instead of the entire organization.
  3. Rounding the percentage seems useful. It doesn't matter to us below a fraction.
  4. Style adjustments:

[!NOTE]

Similar Issues:


Originally posted by @0x4007 in https://github.com/ubiquity-os/plugins-wishlist/issues/52#issuecomment-2386585263

ubiquity-os[bot] commented 1 month ago

This issue seems to be similar to the following issue(s):

0x4007 commented 1 month ago

So this should be within repo only, it appears to be network wide which is not very useful. The above comment is becoming recursive.

sshivaditya2019 commented 1 month ago

@0x4007 One possible reason it's stacking could be a low match threshold as almost all of them are between 75 to 80. Increasing the threshold to 85 might resolve this. I checked the embeddings for the issues, and they are indeed similar. We could use N-Grams for text matching between similar issues, which might help fix this Recursion problem.

0x4007 commented 1 month ago

a low match threshold as almost all of them are between 75 to 80.

I think the implementation needs adjustment because in the context of any idea ever, perhaps the issues are alike but within the context of a repo they are quite different. Maybe we should use log function.

Applying a logarithmic function could be useful in this context to adjust the similarity scores, especially if you're trying to emphasize differences at certain ranges. Here's why:

  1. Score Distribution: If most of your similarity scores cluster between 75 and 80, applying a log function could spread out these values more evenly. This makes subtle differences more distinguishable, helping you avoid stacking similar issues that aren't truly alike.

  2. Context Sensitivity: Logarithmic scaling can help distinguish finer details when values are high, which aligns with your concern that issues might be similar in general but differ in specific contexts within the repo.

You might experiment with different bases (e.g., natural log or log base 10) to see how it impacts your thresholds and similarity evaluation. Applying this transformation before filtering or adjusting the threshold could yield a more nuanced matching process.

sshivaditya2019 commented 1 month ago

a low match threshold as almost all of them are between 75 to 80.

I think the implementation needs adjustment because in the context of any idea ever, perhaps the issues are alike but within the context of a repo they are quite different. Maybe we should use log function.

Applying a logarithmic function could be useful in this context to adjust the similarity scores, especially if you're trying to emphasize differences at certain ranges. Here's why:

  1. Score Distribution: If most of your similarity scores cluster between 75 and 80, applying a log function could spread out these values more evenly. This makes subtle differences more distinguishable, helping you avoid stacking similar issues that aren't truly alike.
  2. Context Sensitivity: Logarithmic scaling can help distinguish finer details when values are high, which aligns with your concern that issues might be similar in general but differ in specific contexts within the repo.

You might experiment with different bases (e.g., natural log or log base 10) to see how it impacts your thresholds and similarity evaluation. Applying this transformation before filtering or adjusting the threshold could yield a more nuanced matching process.

I have implemented cosine similarity followed by edit distance ranking now. Haven't done QA yet.

ubiquity-os[bot] commented 1 month ago
! Fetching all pull requests failed!
ubiquity-os[bot] commented 1 month ago
! Fetching all pull requests failed!
sshivaditya2019 commented 1 month ago

/start

gentlementlegen commented 1 month ago

/start

ubiquity-os[bot] commented 1 month ago
! You have reached your max task limit. Please close out some tasks before assigning new ones.
gentlementlegen commented 1 month ago

So the fact that /start didn't respond is due to https://github.com/ubiquity-os/ubiquity-os-kernel/issues/120 however the failure to retrieve pull-requests is unknown, we should have a look.


In response to this I enabled observability on Cloudflare and improved the logs, let's see if that happens again.

ubiquity-os[bot] commented 1 month ago

@sshivaditya2019 the deadline is at Wed, Oct 2, 3:52 AM UTC

0x4007 commented 1 month ago

@gentlementlegen It is also possible that KV is hitting its limits given all the recent warnings. In this case I guess that the requests would be rate limited 429

gentlementlegen commented 1 month ago

It's a GitHub API request so it is possible that the API got rate limited as well but less likely because we are authenticated as far as I know.

0x4007 commented 1 month ago

Had a good result here which is aligned with my vision. I am thinking to adjust the UI/UX again.

Let's edit and append to the issue specification so that the UI is minimal. Below is the source code and at the bottom is the rendered form. Notice that for the footnote I prefix the numbers all with 0 so they don't collide with normal use. As it turns out, GitHub stringifies them and then uses them as a unique key, so it actually doesn't matter what we use for the ID (numbers, letters, emoji) as long as they match in the footnote.

###### Similar [^01^]

[^01^]: [Near Instant GitHub Actions Cold Boot Times](https://github.com/ubiquity-os/plugin-template/issues/24) 77%
Similar [^01^]

[^01^]: Near Instant GitHub Actions Cold Boot Times 77%

sshivaditya2019 commented 1 month ago

Had a good result here which is aligned with my vision. I am thinking to adjust the UI/UX again.

Let's edit and append to the issue specification so that the UI is minimal. Below is the source code and at the bottom is the rendered form. Notice that for the footnote I prefix the numbers all with 0 so they don't collide with normal use. As it turns out, GitHub stringifies them and then uses them as a unique key, so it actually doesn't matter what we use for the ID (numbers, letters, emoji) as long as they match in the footnote.

I think this a older version of the plugin. The newer plugin has an updated ui.

0x4007 commented 1 month ago

I know, I'm saying that we should actually do it differently. We should append this to the specification because it will look a lot cleaner.

ubiquity-os[bot] commented 1 month ago

 [ 78.9625 WXDAI ] 

@sshivaditya2019
Contributions Overview
ViewContributionCountReward
IssueTask175
IssueComment33.9625
ReviewComment60
Conversation Incentives
CommentFormattingRelevanceReward
@0x4007 One possible reason it's stacking could be a low match t…
3.29
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 61
  wordValue: 0.1
  result: 3.29
0.852.7965
I have implemented cosine similarity followed by edit distance r…
1.06
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 16
  wordValue: 0.1
  result: 1.06
0.70.742
I think this a older version of the plugin. The newer plugin has…
1.06
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 16
  wordValue: 0.1
  result: 1.06
0.40.424
Resolves #25 - Developed a new similarity search function, use…
1.5
content:
  content:
    p:
      score: 0
      elementCount: 1
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 1
  result: 1.5
regex:
  wordCount: 14
  wordValue: 0
  result: 0
0.7-
QA Using Euclidean DistanceUsing 90% similarity for match and …
20.37
content:
  content:
    p:
      score: 0
      elementCount: 5
    ul:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 3
    a:
      score: 5
      elementCount: 3
  result: 17.5
regex:
  wordCount: 23
  wordValue: 0.2
  result: 2.87
0.6-
It represents the straight-line distance between any two points …
11.29
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 115
  wordValue: 0.2
  result: 11.29
0.5-
I experimented with Manhattan distance (also known as Taxi Cab D…
13.2
content:
  content:
    p:
      score: 0
      elementCount: 2
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 79
  wordValue: 0.2
  result: 8.2
0.5-
Updated the new UI for similar issues message. It would not crea…
8.7
content:
  content:
    p:
      score: 0
      elementCount: 2
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 31
  wordValue: 0.2
  result: 3.7
0.8-
Sorry, I meant to say that it wouldn’t create a comment; instead…
2.66
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 21
  wordValue: 0.2
  result: 2.66
0.3-

 [ 68.5665 WXDAI ] 

@0x4007
Contributions Overview
ViewContributionCountReward
IssueSpecification122.23
IssueComment537.5955
ReviewComment108.741
Conversation Incentives
CommentFormattingRelevanceReward
> This issue seems to be similar to the following issue(s):…
7.41
content:
  content:
    p:
      score: 0
      elementCount: 7
    ol:
      score: 1
      elementCount: 1
    li:
      score: 0.5
      elementCount: 4
    hr:
      score: 0
      elementCount: 2
    em:
      score: 0
      elementCount: 1
  result: 3
regex:
  wordCount: 86
  wordValue: 0.1
  result: 4.41
122.23
So this should be within repo only, it appears to be network wid…
2.98
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 24
  wordValue: 0.2
  result: 2.98
0.852.533
I think the implementation needs adjustment because in the conte…
4.21
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 36
  wordValue: 0.2
  result: 4.21
0.853.5785
@gentlementlegen It is also possible that KV is hitting its limi…
3.4
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 28
  wordValue: 0.2
  result: 3.4
0.72.38
Had a good [result here](https://github.com/ubiquity-os/plugin-t…
22.37
content:
  content:
    p:
      score: 0
      elementCount: 3
    a:
      score: 5
      elementCount: 2
    pre:
      score: 0
      elementCount: 1
    h6:
      score: 1
      elementCount: 1
  result: 11
regex:
  wordCount: 116
  wordValue: 0.2
  result: 11.37
0.921.233
I know, I'm saying that we should actually do it differently. We…
8.19
content:
  content:
    p:
      score: 0
      elementCount: 1
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 26
  wordValue: 0.2
  result: 3.19
0.97.871
```suggestionconst modifiedUrl = issue.node.…
0.25
content:
  content:
    pre:
      score: 0
      elementCount: 1
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 3
  wordValue: 0.1
  result: 0.25
0.90.225
```suggestionconst body = "\n###### Similar " + …
0
content:
  content:
    pre:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 0
  wordValue: 0.1
  result: 0
0.8-
Never shorten words in identifiers to reduce cognitive overhead.…
1.06
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 16
  wordValue: 0.1
  result: 1.06
0.70.742
```suggestionconst footnoteLinks = [...Array(++f…
0
content:
  content:
    pre:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 0
  wordValue: 0.1
  result: 0
0.8-
```suggestion```
0
content:
  content:
    pre:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 0
  wordValue: 0.1
  result: 0
0.4-
```suggestionlet finalIndex = 0;```
0
content:
  content:
    pre:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 0
  wordValue: 0.1
  result: 0
0.6-
Why did you use Euclidean Distance?I don't have a lot of deep …
8.29
content:
  content:
    p:
      score: 0
      elementCount: 4
    a:
      score: 5
      elementCount: 1
    hr:
      score: 0
      elementCount: 1
  result: 5
regex:
  wordCount: 61
  wordValue: 0.1
  result: 3.29
0.66.974
4o:
0.1
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 1
  wordValue: 0.1
  result: 0.1
0.10.01
What does that mean? I can edit anybody's specification with col…
1.59
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 26
  wordValue: 0.1
  result: 1.59
0.40.636
I made changes here but why is there no build CI?
0.77
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 11
  wordValue: 0.1
  result: 0.77
0.20.154

 [ 3.596 WXDAI ] 

@gentlementlegen
Contributions Overview
ViewContributionCountReward
IssueComment23.596
Conversation Incentives
CommentFormattingRelevanceReward
So the fact that `/start` didn't respond is due to https…
2.92
content:
  content:
    p:
      score: 0
      elementCount: 2
    hr:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 53
  wordValue: 0.1
  result: 2.92
0.82.336
It's a GitHub API request so it is possible that the API got rat…
1.8
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 30
  wordValue: 0.1
  result: 1.8
0.71.26
0x4007 commented 1 month ago

Ah this still is using the bad conversation rewards algorithm. It would be nice to push the new algorithm to main @gentlementlegen

gentlementlegen commented 1 month ago

@0x4007 it seems to use the updated version? What is wrong in it

0x4007 commented 1 month ago

https://github.com/ubiquity-os-marketplace/generate-vector-embeddings/issues/25#issuecomment-2388870939 caught my eye as being too high but I'm reviewing the statistics:

content:
  content: # strange there's content.content
    p:
      score: 0
      elementCount: 3
    a:
      score: 5 # looks like its counting the footnotes as links. I only see three related to the footnotes (these should be hardcoded to be removed.) but there is one unaccounted for that I cant find?
      elementCount: 2
    pre:
      score: 0
      elementCount: 1
    h6:
      score: 1
      elementCount: 1
  result: 11

regex:
  wordCount: 116
  wordValue: 0.2
  result: 11.37

I think we have an unaddressed scenario of dealing with footnotes. So we'll need to make a new task then. I also have a feeling that it is counting words within the code block, we should not have this. This should be indicated in the analytics overview if a tag words are being ignored. Also its strange to me that it parsed it as pre instead of code perhaps its because I didn't include the syntax highlighting header?

Regarding config, I think wordCount should probably default to 0.1 including for author, not sure why its 0.2!

  1. Ignore links related to footnotes
  2. Do not include footnotes in word count credit
0x4007 commented 1 month ago

https://github.com/Dicklesworthstone/fast_vector_similarity some interesting algorithms that are relevant @sshivaditya2019