ubiquity-os-marketplace / conversation-rewards

1 stars 19 forks source link

Refinements #97

Open 0x4007 opened 1 month ago

0x4007 commented 1 month ago

Qualitative and quantitative analysis have unexpected results according to how I implemented in v1. Research, and refine.

I think we need to tweak the qualitative analysis. Somehow I got 0 relevance on my comments which didn't seem to be the case before with gpt3.5 10x samples.

Also I should be getting img credit.

Seems like there's problems with quantitative analysis as well

Originally posted by @0x4007 in https://github.com/ubiquibot/command-start-stop/issues/14#issuecomment-2308672581

0x4007 commented 1 month ago

Made it a week to guarantee a good job. This is a core feature that needs to work at least as well as before.

0x4007 commented 3 weeks ago

Seems like test with known samples is a good next step here.

sshivaditya2019 commented 2 weeks ago

/start

ubiquity-os[bot] commented 2 weeks ago
DeadlineFri, Sep 13, 11:02 AM UTC
Registered Wallet 0xDAba6e01D15Db560b88C8F426b016801f79e1F69
Tips:
<ul>
<li>Use <code>/wallet 0x0000...0000</code> if you want to update your registered payment wallet address.</li>
<li>Be sure to open a draft pull request as soon as possible to communicate updates on your progress.</li>
<li>Be sure to provide timely updates to us when requested, or you will be automatically unassigned from the task.</li>
<ul>
sshivaditya2019 commented 2 weeks ago

@0x4007

0x4007 commented 2 weeks ago

@0x4007

  • Could you share same sample comments which previously had high relevance scores ?

Unfortunately you or I would just have to manually check old completed tasks and see their rewards. None in particular come to mind, but I would pay attention to those posted by "ubiquibot" instead of "ubiquity-os" as those used an older version of conversation rewards that seemed more accurate.

  • Also, Could you point me to the section of the code that gave credits for images ? If it does exist.

It is under the "formatting score" or "quantitative scoring" section. You might be able to search for these keywords in the codebase. I am mobile so pointing to code is not feasible. @gentlementlegen perhaps you can help with this point.

gentlementlegen commented 2 weeks ago

We don't give credit for the image itself but apply a different value and multiplier based on the html entities, like <img/>. The configuration object is located here, and the multiplier would be applied here.

ubiquity-os[bot] commented 2 weeks ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

0x4007 commented 2 weeks ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

@gentlementlegen Really nice to see this finally working as expected. Except the revision hash in the metadata is undefined. This should be fixed!

sshivaditya2019 commented 1 week ago

@0x4007

I need to evaluate the relevance of GitHub contributors' comments to a specific issue specification. Specifically, I'm interested in how much each comment helps to further define the issue specification or contributes new information or research relevant to the issue. Please provide a float between 0 and 1 to represent the degree of relevance. A score of 1 indicates that the comment is entirely relevant and adds significant value to the issue, whereas a score of 0 indicates no relevance or added value. A stringified JSON is given below that contains the specification and contributors' comments. Each comment in the JSON has a unique ID and comment content.{ "specification": "Do not show warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "Updating the password recovery process could help in user management." } ] } To what degree are each of the comments in the conversation relevant and valuable to further defining the issue specification? Please reply with ONLY a JSON where each key is the comment ID given in JSON above, and the value is a float number between 0 and 1 corresponding to the comment. The float number should represent the degree of relevance and added value of the comment to the issue. The total number of properties in your JSON response should equal exactly 3.

I tried with this prompt, for models gpt-4o, gpt-3.5-Turbo, chatgpt, almost all of the models give the same relevance values.I think the problem is that there isn't enough context. On its own, the comment might not seem relevant to the issue description and details

I would suggest a better approach would be reduce the temperature and top_p values, perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation. Following are the results which I got from the GPT-4o and has the same values has GPT 3.5-Turbo

{
  "1": 0.1,
  "2": 0.2,
  "3": 0.0
}

Explanation:

0x4007 commented 1 week ago

I would suggest a better approach would be reduce the temperature and top_p values,

Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)

perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.

I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.

sshivaditya2019 commented 1 week ago

I would suggest a better approach would be reduce the temperature and top_p values,

Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)

perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.

I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.

So I tested few examples a temperature value of 0.2 works fine for now with GPT4o model. I don't think the current implementation does that, prompt expects a { specification: issue, comments: comments } object, with the comments being of type { id: number; comment: string }[]. I can probably rewrite that part, if that's fine.

0x4007 commented 1 week ago

I would suggest a better approach would be reduce the temperature and top_p values,

Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)

perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.

I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.

@gentlementlegen

gentlementlegen commented 1 week ago

Depends what is meant by "all together". Now it is all together, but by user, not all comment from the whole issue / PR in one block. Original implementation was the same except that there was a batch of 10 attemps averaged.

0x4007 commented 1 week ago

I see. Given the extended context windows of the latest models, perhaps we should do it all in one shot?

gentlementlegen commented 1 week ago

If that enhances precision and gives more context for better results it is nice, however I wonder if we would easily burst through the max tokens doing so for long reviews.

0x4007 commented 1 week ago

Context window is too long these days I am pretty sure it will be fine

sshivaditya2019 commented 1 week ago

@0x4007 I can rewrite the code for the original implementation, but a better way would be to provide a entire conversation dump and the user's conversations as well. I tried this

I need to evaluate the relevance of GitHub contributors' comments to a specific issue specification. Specifically, I'm interested in how much each comment helps to further define the issue specification or contributes new information or research relevant to the issue. The following is the stringified dump of the entire issue and the conversation chain with each conversation delimited by <-----> The bot shows a warning message on tasks that were opened recently, as seen here. It should only display this warning above a certain threshold, which comes from the configuration.A possible cause would be that value missing in the current configuration. If that is the case, the default threshold should be set, and definitely above 0 days.Tasks to be carried out: display a warning message when the task is above the configuration threshold do not display the warning if it is under that threshold change the configuration to accept a string representing a duration instead of a time stamp related tests <----->@Keyrxng would this be the fix for it? ubiquity/work.ubq.fi#74 (comment)<----->yeah @gentlementlegen aae3cba resolved here in #10<---->Ok cool thank you, I will close this one after we got your fix merged.<----->So the font is wrong be sure to match the styles of the old assign message exactly.<----->I triple checked I cannot see any font difference.Maybe there is a difference on mobile devices if you are using one?<---->registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that<---->Use <code> not <samp><---->@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there.<---->een reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc.<---->f its only a problem on mobile then perhaps <samp> is the best decision!<---->Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it.If its only a problem on mobile then perhaps is the best decision! Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard<---->@gentlementlegen you want to add it into the spec/title here and I'll handle both?<---->@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not once created already. Please provide a float between 0 and 1 to represent the degree of relevance. A score of 1 indicates that the comment is entirely relevant and adds significant value to the issue, whereas a score of 0 indicates no relevance or added value. A stringified JSON is given below that contains the specification and contributors' comments. Each comment in the JSON has a unique ID and comment content. { "specification": "Do not show warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "Updating the password recovery process could help in user management." } ] } To what degree are each of the comments in the conversation relevant and valuable to further defining the issue specification? Please reply with ONLY a JSON where each key is the comment ID given in JSON above, and the value is a float number between 0 and 1 corresponding to the comment. The float number should represent the degree of relevance and added value of the comment to the issue. The total number of properties in your JSON response should equal exactly 3.

For GPT-4o:

{
  "1": 0.5,
  "2": 0.0,
  "3": 0.0
}

For GPT-3.5 Turbo:

{
  "1": 0.5,
  "2": 0.3,
  "3": 0.0
}

For Claude Sonnet3.5:

{
  "1": 0.2,
  "2": 0.4,
  "3": 0.0
}

I think GPT 4o is not a good model choice for this task.

0x4007 commented 1 week ago

Claude says the second comment is the most on topic which the GPTs disagree with. Thats a red flag. Also, there is a way to instruct GPT to return a JSON using headers, or some property name in the SDK. Don't specify in the prompt. I generally think that GPT is smarter but for massive context I've seen Claude do better on some tasks specifically related to coding.

sshivaditya2019 commented 1 week ago

Claude says the second comment is the most on topic which the GPTs disagree with. Thats a red flag. Also, there is a way to instruct GPT to return a JSON using headers, or some property name in the SDK. Don't specify in the prompt. I generally think that GPT is smarter but for massive context I've seen Claude do better on some tasks specifically related to coding.

"response_format" : {
                "type": "json_object" 
},

This will ensure it always returns an JSON. I think type also supports json_schema, so it would return json objects with a particular schema as well. I think this is supported only by GPT 4o for now.

For general tasks, OpenAI models work fine, but I think for this task if we are not using GPT-3.5 Turbo, we should should be using claude's Sonnet or Opus models.

0x4007 commented 1 week ago

we should should be using claude's Sonnet or Opus models.

Unfortunately we need @gentlementlegen playground feature to fine tune conversation reward config and the LLM details.

They have been cooking it up for a few days now.

sshivaditya2019 commented 1 week ago

we should should be using claude's Sonnet or Opus models.

Unfortunately we need @gentlementlegen playground feature to fine tune conversation reward config and the LLM details.

They have been cooking it up for a few days now.

There are some non LLM approaches as well, but I think they would return the same relevance score, intrinsically the comments are not very relevant to issue topic or body. Example If its only a problem on mobile then perhaps <samp> is the best decision! and topic of the issue was Do not show warning message on tasks that were not created a long time ago.

0x4007 commented 1 week ago

Then we need to adjust our prompt

sshivaditya2019 commented 1 week ago

Possible new prompt have to distill it further. Works fine with GPT4o


Instruction: 
Go through all the comments first keep them in memory, then say "I KNOW" and read the following prompt

OUTPUT FORMAT:
{ID: CONNECTION SCORE} For Each record, based on the average value from the CONNECTION SCORE from ALL COMMENTS, TITLE and BODY, one for each comment under evaluation
Global Context:
Specification: "Do not show a warning message on tasks that were not created a long time ago."
ALL COMMENTS:
Comment ID 1: "Ok cool thank you, I will close this one after we got your fix merged."
Comment ID 2: "So the font is wrong be sure to match the styles of the old assign message exactly."
Comment ID 3: "I triple checked I cannot see any font difference. Maybe there is a difference on mobile devices if you are using one?"
Comment ID 4: "Registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that."
Comment ID 5: "Use <code> not <samp>."
Comment ID 6: "@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there."
Comment ID 7: "Been reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc."
Comment ID 8: "If it's only a problem on mobile then perhaps <samp> is the best decision!"
Comment ID 9: "Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it."
Comment ID 10: "Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard."
Comment ID 11: "@gentlementlegen you want to add it into the spec/title here and I'll handle both?"
Comment ID 12: "@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not one created already."

IMPORTANT CONTEXT:
You have now seen all the comments made by other users, keeping the comments in mind think in what ways comments to be evaluated be connected. The comments that were related to the comment under evaluation might come after or before them in the list of all comments but they would be there in ALL COMMENTS. COULD BE BEFORE OR AFTER, you have diligently search through all the comments in ALL COMMENTS.

START EVALUATING:
Comments for Evaluation: { "specification": "Do not show a warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "I think this should because of a hardware bug or something in the OS." } ] }

POST EVALUATION:
THE RESULT FROM THIS SHOULD BE ONLY THE SCORES BASED ON THE FLOATING POINT VALUE CONNECTING HOW CLOSE THE COMMENT IS FROM ALL COMMENTS AND TITLE AND BODY.

From ALL COMMENTS, find the comments most relevant to each comment under evaluation and rank them in descending order for the top 3, for each comment. Though, you need to fill in details, in conjunctions with the title and spec info and ALL COMMENTS, it should be somewhat relevant to the overall topic. The comment should be closely related to something mentioned and discussed in the ALL COMMENTS and should be relevant to the central topic (TITLE AND ISSUE SPEC)

Now Assign them scores a float value ranging from 0 to 1, where 0 is spam (lowest value), and 1 is something that's very relevant (Highest Value), here relevance should mean a variety of things, it could be a fix to the issue, it could be a bug in solution, it could a SPAM message, it could be comment, that on its own does not carry weight, but when CHECKED IN ALL COMMENTS, may be a crucial piece of information for debugging and solving the ticket. If YOU THINK ITS NOT RELEATED to ALL COMMENTS or TITLE OR ISSUE SPEC, then give it a 0 SCORE.

OUTPUT:
RETURN ONLY A JSON with the ID and the connection score (FLOATING POINT VALUE) with ALL COMMENTS TITLE AND BODY for each comment under evaluation.  RETURN ONLY ONE CONNECTION SCORE VALUE for each comment
0x4007 commented 1 week ago

Regarding your prompt, I feel like it could be useful context for the LLM to know who is posting which comment. The issue author is most likely to remain on topic and to guide the conversation I think. We can eventually iterate the prompt to accommodate this?

Another idea to enhance the accuracy is to generate embeddings and then sort how on topic each comment is, and then finally assigning a descending floating point score to each by asking the LLM to do so.

Basically the embeddings can sort the order and then the LLM can finely score each, and it knows that it must be in descending order.

According to the embedding criteria I posted about we could also use anomaly detection and basically set them aside as off topic, and not score them.

I also feel like we could replace word count with the summarized points of comments. The idea is that it can potentially cut through the noise (verbose explanations) and credit for the salient points/ideas only. This needs more research though.

sshivaditya2019 commented 1 week ago

Regarding your prompt, I feel like it could be useful context for the LLM to know who is posting which comment. The issue author is most likely to remain on topic and to guide the conversation I think. We can eventually iterate the prompt to accommodate this?

Another idea to enhance the accuracy is to generate embeddings and then sort how on topic each comment is, and then finally assigning a descending floating point score to each by asking the LLM to do so.

Basically the embeddings can sort the order and then the LLM can finely score each, and it knows that it must be in descending order.

According to the embedding criteria I posted about we could also use anomaly detection and basically set them aside as off topic, and not score them.

I also feel like we could replace word count with the summarized points of comments. The idea is that it can potentially cut through the noise (verbose explanations) and credit for the salient points/ideas only. This needs more research though.

Voyage AI has Rerankers, we could try ranking comment vs (ALL Comment + Title + Body), and assign percentile based relevance scores. I think this is one possible solution, but it does not care about verbose/concise explanations.

ubiquity-os[bot] commented 1 week ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

gentlementlegen commented 1 week ago

I feel like so strange stuff is going on with the reminders, there was activity within the 3.5 last days on that issue. @Keyrxng rfc

sshivaditya2019 commented 1 week ago

This prompt works fine, I have tried with multiple examples

Instruction: 
Go through all the comments first keep them in memory, then start with the following prompt

OUTPUT FORMAT:
{ID: CONNECTION SCORE} For Each record, based on the average value from the CONNECTION SCORE from ALL COMMENTS, TITLE and BODY, one for each comment under evaluation
Global Context:
Specification: "Do not show a warning message on tasks that were not created a long time ago."
ALL COMMENTS:
Comment ID 1: "Ok cool thank you, I will close this one after we got your fix merged."
Author: @gentlementlegen

Comment ID 2: "So the font is wrong be sure to match the styles of the old assign message exactly."
Author: @0x4007

Comment ID 3: "I triple checked I cannot see any font difference. Maybe there is a difference on mobile devices if you are using one?"
Author: @gentlementlegen

Comment ID 4: "Registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that."
Author: @Keyrxng

Comment ID 5: "Use <code> not <samp>."
Author: @0x4007

Comment ID 6: "@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there."
Author: @gentlementlegen

Comment ID 7: "Been reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc."
Author: @Keyrxng

Comment ID 8: "If it's only a problem on mobile then perhaps <samp> is the best decision!"
Author: @0x4007

Comment ID 9: "Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it."
Author: @gentlementlegen

Comment ID 10: "Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard."
Author: @Keyrxng

Comment ID 11: "@gentlementlegen you want to add it into the spec/title here and I'll handle both?"
Author: @Keyrxng

Comment ID 12: "@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not one created already."
Author: @gentlementlegen

IMPORTANT CONTEXT:
You have now seen all the comments made by other users, keeping the comments in mind think in what ways comments to be evaluated be connected. The comments that were related to the comment under evaluation might come after or before them in the list of all comments but they would be there in ALL COMMENTS. COULD BE BEFORE OR AFTER, you have diligently search through all the comments in ALL COMMENTS.

START EVALUATING:
Comments for Evaluation: {"specification":"Do not show a warning message on tasks that were not created a long time ago","comments":[{"id":1,"comment":"Ok cool thank you, I will close this one after we got your fix merged.","author":"@gentlementlegen"},{"id":2,"comment":"So the font is wrong be sure to match the styles of the old assign message exactly.","author":"@0x4007"},{"id":3,"comment":"I think this should because of a hardware bug or something in the OS.","author":"@random"}]}

POST EVALUATION:
THE RESULT FROM THIS SHOULD BE ONLY THE SCORES BASED ON THE FLOATING POINT VALUE CONNECTING HOW CLOSE THE COMMENT IS FROM ALL COMMENTS AND TITLE AND BODY.

Now Assign them scores a float value ranging from 0 to 1, where 0 is spam (lowest value), and 1 is something that's very relevant (Highest Value), here relevance should mean a variety of things, it could be a fix to the issue, it could be a bug in solution, it could a SPAM message, it could be comment, that on its own does not carry weight, but when CHECKED IN ALL COMMENTS, may be a crucial piece of information for debugging and solving the ticket. If YOU THINK ITS NOT RELEATED to ALL COMMENTS or TITLE OR ISSUE SPEC, then give it a 0 SCORE.

OUTPUT:
RETURN ONLY A JSON with the ID and the connection score (FLOATING POINT VALUE) with ALL COMMENTS TITLE AND BODY for each comment under evaluation.  RETURN ONLY ONE CONNECTION SCORE VALUE for each comment

it returns

{
  "1": 0.8,
  "2": 0.6,
  "3": 0.0
}
ubiquity-os[bot] commented 1 week ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

sshivaditya2019 commented 5 days ago

@0x4007 I am trying to set this up. But process.issue.test.ts is failing for me. Is there something particular I need to set up apart from the .env.

gentlementlegen commented 5 days ago

@sshivaditya2019 Could you elaborate on your issue? also it would help if you open a PR as a draft so we can see what's going on.

sshivaditya2019 commented 5 days ago

@sshivaditya2019 Could you elaborate on your issue? also it would help if you open a PR as a draft so we can see what's going on.

In process.issue.test.ts, the tests for Should generate permits and Should generate GitHub comment are failing, even before I made any changes. Is this expected?

For the Should generate permits test, Jest is reporting an error indicating that permitUrl is missing from the processor.dump.

gentlementlegen commented 5 days ago

@sshivaditya2019 You can see in the pull-request you linked that tests are passing successfully, my best is some problem in your local setup. Maybe you have some environment variables missing, or some configuration problems?

If the permit URL is missing it is probably because permits failed to generate. I don't know if you have more detailed errors, but as long as it works on your linked pull-request you shouldn't need to worry much about your local tests.

sshivaditya2019 commented 5 days ago

@gentlementlegen, Is there any way to check relevances locally ?. Most of the tests have a mocked implementation of _evaluateComments

gentlementlegen commented 5 days ago

To check them locally, you will need to use your own OpenAI key to get ChatGpt to run. What I do on my side is have a local Jest test that allows me to run it on any issue. Please check out this Gist it might help you.

sshivaditya2019 commented 4 days ago

To check them locally, you will need to use your own OpenAI key to get ChatGpt to run. What I do on my side is have a local Jest test that allows me to run it on any issue. Please check out this Gist it might help you.

This test was very helpful and should definitely be included in the testing suite.

gentlementlegen commented 4 days ago

The problem is that this tests runs on real data which is subject to change. What could be done is including this test but excluding it from the test run. However, there is this issue that should eventually get that covered, didn't have time to look into it.

sshivaditya2019 commented 4 days ago

The problem is that this tests runs on real data which is subject to change. What could be done is including this test but excluding it from the test run. However, there is this issue that should eventually get that covered, didn't have time to look into it.

Let me know if you’re not working on it right now, I can take a look at it then.

sshivaditya2019 commented 4 days ago

@0x4007. I think the issue spec for this one is a bit vague. I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket.

Is the goal for the ticket to rewrite the entire comment-evaluator-module with the embeddings and vector search? Or is it something else ?

0x4007 commented 4 days ago

@0x4007. I think the issue spec for this one is a bit vague.

For vague specs which happen occasionally, we are to share research and concerns on here. We all get credited for it.

I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket.

Is the goal for the ticket to rewrite the entire comment-evaluator-module with the embeddings and vector search? Or is it something else ?

Some recent observations:

  1. We are very likely going to entirely remove the word counter feature and instead generate embeddings and understand how much value a comment adds to solving the problem.
  2. We need to change the code to emphasize crediting per HTML tag and then the meaning of the comment with embeddings instead of word count.
  3. We still need to test how far we can get regarding vector embeddings and how well it solves our problem.
sshivaditya2019 commented 4 days ago

@0x4007. I think the issue spec for this one is a bit vague.

For vague specs which happen occasionally, we are to share research and concerns on here. We all get credited for it.

I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket. Is the goal for the ticket to rewrite the entire comment-evaluator-module with the embeddings and vector search? Or is it something else ?

Some recent observations:

  1. We are very likely going to entirely remove the word counter feature and instead generate embeddings and understand how much value a comment adds to solving the problem.

@0x4007 Embedding will not solve the problem, I think the present relevance scoring is the best technique. According to me a better approach would be to use something like Bag of Words model with hierarchal labeling, and assign scores according to the depth of the concept. Let me know about this I can put together a small write up on this.

  1. We need to change the code to emphasize crediting per HTML tag and then the meaning of the comment with embeddings instead of word count.

I think present implementation focuses only on formatting. Effectively this would mean that entire formatting-evaluator-module.ts being rewritten.

  1. We still need to test how far we can get regarding vector embeddings and how well it solves our problem.

I am very skeptical about embeddings in this use case. As I mentioned before, embeddings provide local context, and references, and on their own would not mean anything. I created my own script to plot visualize the embeddings and perform PCA to extract cluster centers.

Original Comments:

image

Embeddings Plot with Comments:

image

Here, you can see three distinctive cluster centers, In the embeddings plot with comment I have added a new comment Something random blah blah, as you can see it is near a cluster center, and would have high similarity in vector search. This should not happen. My suggestion would be to use a nlp method instead of embedding based vector search. Let me know if you want to set this up on your end I can help you with that.

Python Script

0x4007 commented 4 days ago

My peer suggested some search engine results related algorithm. I'm asking him now to clarify which. This should help us see how on topic it is for the specification. We could consider adding this as one of several dimensions we evaluate the comments by.

Starting to wonder if sub plugins are realistic, or just make npm modules (or if we should just use something like git modules)

ChatGPT is recommending me:

We are exploring methods to evaluate comment readability and conciseness using Flesch-Kincaid readability metrics. These formulas assess the complexity and clarity of text based on sentence length and word syllables:

  1. Flesch Reading Ease: Rates text on a scale from 0 (difficult) to 100 (easy). Scores can help determine how easily a comment can be understood.
  2. Flesch-Kincaid Grade Level: Converts readability into an estimated school grade level required to comprehend the text.

These metrics can help us identify verbose comments vs. concise, high-value inputs

This seems a lot more interesting compared to word count but we should test.

The idea is that we can develop a proprietary algorithm that combines several strategies. Ideally we should make a playground that we can plugin these different modules and run tests against live GitHub issues to tweak it

Strategy ideas:

  1. HTML elements counter
  2. Word counter ❌
  3. Flesch
  4. Something search engine algo related
  5. LLM judge (we do this now for relevance scoring)
  6. Generate concise summary and calculate compression ratio

@gentlementlegen

gentlementlegen commented 4 days ago

conversation-rewards already supports modules within itself that you can enable / disable to change the final output. You can do as many transforming modules as you want and enable / disable them through the configuration.

0x4007 commented 3 days ago

My peer got back to me regarding the search engine recommendation


TF-IDF (Term Frequency-Inverse Document Frequency) is a classic algorithm used in search and information retrieval to evaluate how important a word is to a document relative to a collection of documents (often referred to as a "corpus"). It helps identify which terms are most relevant to the context of a specific document.

In the Context of Your Goals: Evaluating GitHub Comments

Given your objective to measure the value of GitHub comments in relation to problem-solving, TF-IDF could be a useful tool to assess the relevance and informational density of individual comments with respect to the overall issue or conversation.

Here's how TF-IDF might be applied in your scenario:

1. How TF-IDF Works

  \[
  \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
  \]
  \[
  \text{IDF}(t) = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
  \]
  \[
  \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
  \]

2. Applying TF-IDF to Evaluate Comment Relevance:

3. Enhancing Your Continuum-Based Scoring System:

Practical Steps for Implementation:

  1. Preprocess the Data: Tokenize the comments and issue descriptions, remove stop words, and normalize the text (e.g., lowercase conversion).
  2. Calculate TF-IDF: Apply TF-IDF to generate relevance scores for each comment.
  3. Score Aggregation: Aggregate the TF-IDF scores to quantify each comment’s overall contribution to solving the issue.

Benefits for Your Goals:

Using TF-IDF will give you an effective way to measure the informational value and relevance of comments, aligning well with your goal of continuum-based scoring. Let me know if you’d like to dive deeper into any specific aspect of this approach!

sshivaditya2019 commented 2 days ago

My peer got back to me regarding the search engine recommendation

TF-IDF (Term Frequency-Inverse Document Frequency) is a classic algorithm used in search and information retrieval to evaluate how important a word is to a document relative to a collection of documents (often referred to as a "corpus"). It helps identify which terms are most relevant to the context of a specific document.

In this case, I am not sure how this is relevant. Here, we are assigning scores, within a Comment Thread Context and the comments are mutually exclusive from other Comments Thread, in terms of assigning relevance.

In the Context of Your Goals: Evaluating GitHub Comments

Given your objective to measure the value of GitHub comments in relation to problem-solving, TF-IDF could be a useful tool to assess the relevance and informational density of individual comments with respect to the overall issue or conversation.

Here's how TF-IDF might be applied in your scenario:

1. How TF-IDF Works

  • Term Frequency (TF): Measures how frequently a term appears in a comment. Higher frequencies suggest that the term is more important within that comment.

    2. Applying TF-IDF to Evaluate Comment Relevance:

  • Identifying Key Terms in Comments:

    • TF-IDF will highlight terms in each comment that are not just common but are also distinctive in the context of the overall issue. This helps identify comments with unique and relevant insights.
  • Assessing Relevance to the Issue Description:

    • By comparing the TF-IDF scores of words in a comment to those in the issue description, you can measure how closely a comment aligns with the core problem. Comments with terms that have high TF-IDF relevance scores relative to the issue description are more likely to be valuable.

Just for context, TF-IDF is a transformation technique, not would give out a real valued vector. With which we then apply some distance metric like cosine similarity. This is almost similar to the Embedding and Vector Search.

  • Filtering Out Low-Value Contributions:

    • Comments that consist primarily of high-TF but low-IDF terms (e.g., generic phrases or filler words) can be identified as less valuable. This is particularly useful for identifying verbose comments from junior developers that lack unique insights.

These are not fixed. In the linked issue spec, Comments were relevant to the topic, but were flagged irrelevant. This will not be a issue as we would either way implement stemming or lemmatizing the input phrases and be tagging for POS (Parts of Speech).

3. Enhancing Your Continuum-Based Scoring System:

  • Weighted Relevance Scores: Use TF-IDF scores to assign a relevance weight to each comment, allowing you to rank comments on a continuum of importance rather than using binary relevance.
  • Combining with Other Metrics: Integrate TF-IDF scores with other continuous metrics (e.g., semantic similarity, readability) to create a comprehensive scoring system that reflects both the specificity and value of a comment.

Practical Steps for Implementation:

  1. Preprocess the Data: Tokenize the comments and issue descriptions, remove stop words, and normalize the text (e.g., lowercase conversion).
  2. Calculate TF-IDF: Apply TF-IDF to generate relevance scores for each comment.
  3. Score Aggregation: Aggregate the TF-IDF scores to quantify each comment’s overall contribution to solving the issue.

Benefits for Your Goals:

  • Objective Measurement of Relevance: TF-IDF provides a quantitative way to gauge how closely comments relate to the problem at hand.
  • Filtering Out Noise: Helps distinguish between high-value contributions and generic or off-topic comments.

I don't think this is possible. We would need to have some dictionary or something (WordNet), to assign values for words. This would not cater to specific words in comments like for eg: bug or fix, these words on its own will not have any value, and may appear as off topic.

  • Complementary to Other Techniques: Can be combined with PageRank, readability scores, or semantic similarity measures for a more holistic evaluation.

Using TF-IDF will give you an effective way to measure the informational value and relevance of comments, aligning well with your goal of continuum-based scoring. Let me know if you’d like to dive deeper into any specific aspect of this approach!

TF-IDF is a good starting point, but I don't believe it suits this problem well. We need to assign scores or relevances to comments, and since no two comment threads will have the same set of high TF-IDF words, this could penalize terms that are highly relevant to the context individually but not as whole for multiple comment threads.

sshivaditya2019 commented 2 days ago

I came up with a new approach to categorize comments into topic bins (Topic would be added using LLM/ML Model). We can then perform a similarity search using the topic, issue_title, and issue_body to generate a Topic-Comment-Alignment score for each comment.

Next, we can assess user engagement for each comment based on various roles, such as reactions and replies. The weight assigned to different types of engagement can vary depending on the role (e.g., Author, Collaborator, etc.).

Additionally, we’ll incorporate a credibility score to evaluate whether a comment was made by a verified member of the organization, a regular collaborator, or an unknown user.

The overall score could be calculated using the following formula:

$$ \text{Final Score} = \frac{(TCA \times E + C)}{W} $$

Where:

This will allow us to effectively evaluate the quality and relevance of comments.

0x4007 commented 2 days ago

Credibility score we can adjust to the author no matter their position/relationship with the organization.

The spec author generally has the clearest vision on the task so if what is commented aligns with them (i.e. they agree) then more credit should be offered (this of course is only in the context of funded tasks)

Reactions we usually have a very limited amount of but I think reactions from author and core team could be a positive indicator.

If we can attribute block quotes that can be interesting. The problem there is that I generally comment from mobile and block quotes can be inconvenient but sometimes I make sure to do in order to enhance clarity. I would be more curious to experiment with attributing block quote crediting.

sshivaditya2019 commented 1 day ago

Credibility score we can adjust to the author no matter their position/relationship with the organization.

The spec author generally has the clearest vision on the task so if what is commented aligns with them (i.e. they agree) then more credit should be offered (this of course is only in the context of funded tasks)

Reactions we usually have a very limited amount of but I think reactions from author and core team could be a positive indicator.

If we can attribute block quotes that can be interesting. The problem there is that I generally comment from mobile and block quotes can be inconvenient but sometimes I make sure to do in order to enhance clarity. I would be more curious to experiment with attributing block quote crediting.

Otherwise is the method and scoring criteria fine ? @Keyrxng rfc, I think this should be good enough

ubiquity-os[bot] commented 1 day ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.