ubiquibot / conversation-rewards

0 stars 17 forks source link

Refinements #97

Open 0x4007 opened 3 weeks ago

0x4007 commented 3 weeks ago

Qualitative and quantitative analysis have unexpected results according to how I implemented in v1. Research, and refine.

I think we need to tweak the qualitative analysis. Somehow I got 0 relevance on my comments which didn't seem to be the case before with gpt3.5 10x samples.

Also I should be getting img credit.

Seems like there's problems with quantitative analysis as well

Originally posted by @0x4007 in https://github.com/ubiquibot/command-start-stop/issues/14#issuecomment-2308672581

0x4007 commented 3 weeks ago

Made it a week to guarantee a good job. This is a core feature that needs to work at least as well as before.

0x4007 commented 2 weeks ago

Seems like test with known samples is a good next step here.

sshivaditya2019 commented 1 week ago

/start

ubiquity-os[bot] commented 1 week ago
DeadlineFri, Sep 13, 11:02 AM UTC
Registered Wallet 0xDAba6e01D15Db560b88C8F426b016801f79e1F69
Tips:
<ul>
<li>Use <code>/wallet 0x0000...0000</code> if you want to update your registered payment wallet address.</li>
<li>Be sure to open a draft pull request as soon as possible to communicate updates on your progress.</li>
<li>Be sure to provide timely updates to us when requested, or you will be automatically unassigned from the task.</li>
<ul>
sshivaditya2019 commented 1 week ago

@0x4007

0x4007 commented 1 week ago

@0x4007

  • Could you share same sample comments which previously had high relevance scores ?

Unfortunately you or I would just have to manually check old completed tasks and see their rewards. None in particular come to mind, but I would pay attention to those posted by "ubiquibot" instead of "ubiquity-os" as those used an older version of conversation rewards that seemed more accurate.

  • Also, Could you point me to the section of the code that gave credits for images ? If it does exist.

It is under the "formatting score" or "quantitative scoring" section. You might be able to search for these keywords in the codebase. I am mobile so pointing to code is not feasible. @gentlementlegen perhaps you can help with this point.

gentlementlegen commented 1 week ago

We don't give credit for the image itself but apply a different value and multiplier based on the html entities, like <img/>. The configuration object is located here, and the multiplier would be applied here.

ubiquity-os[bot] commented 5 days ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

0x4007 commented 5 days ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

@gentlementlegen Really nice to see this finally working as expected. Except the revision hash in the metadata is undefined. This should be fixed!

sshivaditya2019 commented 4 days ago

@0x4007

I need to evaluate the relevance of GitHub contributors' comments to a specific issue specification. Specifically, I'm interested in how much each comment helps to further define the issue specification or contributes new information or research relevant to the issue. Please provide a float between 0 and 1 to represent the degree of relevance. A score of 1 indicates that the comment is entirely relevant and adds significant value to the issue, whereas a score of 0 indicates no relevance or added value. A stringified JSON is given below that contains the specification and contributors' comments. Each comment in the JSON has a unique ID and comment content.{ "specification": "Do not show warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "Updating the password recovery process could help in user management." } ] } To what degree are each of the comments in the conversation relevant and valuable to further defining the issue specification? Please reply with ONLY a JSON where each key is the comment ID given in JSON above, and the value is a float number between 0 and 1 corresponding to the comment. The float number should represent the degree of relevance and added value of the comment to the issue. The total number of properties in your JSON response should equal exactly 3.

I tried with this prompt, for models gpt-4o, gpt-3.5-Turbo, chatgpt, almost all of the models give the same relevance values.I think the problem is that there isn't enough context. On its own, the comment might not seem relevant to the issue description and details

I would suggest a better approach would be reduce the temperature and top_p values, perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation. Following are the results which I got from the GPT-4o and has the same values has GPT 3.5-Turbo

{
  "1": 0.1,
  "2": 0.2,
  "3": 0.0
}

Explanation:

0x4007 commented 4 days ago

I would suggest a better approach would be reduce the temperature and top_p values,

Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)

perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.

I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.

sshivaditya2019 commented 4 days ago

I would suggest a better approach would be reduce the temperature and top_p values,

Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)

perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.

I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.

So I tested few examples a temperature value of 0.2 works fine for now with GPT4o model. I don't think the current implementation does that, prompt expects a { specification: issue, comments: comments } object, with the comments being of type { id: number; comment: string }[]. I can probably rewrite that part, if that's fine.

0x4007 commented 4 days ago

I would suggest a better approach would be reduce the temperature and top_p values,

Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)

perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.

I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.

@gentlementlegen

gentlementlegen commented 3 days ago

Depends what is meant by "all together". Now it is all together, but by user, not all comment from the whole issue / PR in one block. Original implementation was the same except that there was a batch of 10 attemps averaged.

0x4007 commented 3 days ago

I see. Given the extended context windows of the latest models, perhaps we should do it all in one shot?

gentlementlegen commented 3 days ago

If that enhances precision and gives more context for better results it is nice, however I wonder if we would easily burst through the max tokens doing so for long reviews.

0x4007 commented 3 days ago

Context window is too long these days I am pretty sure it will be fine

sshivaditya2019 commented 3 days ago

@0x4007 I can rewrite the code for the original implementation, but a better way would be to provide a entire conversation dump and the user's conversations as well. I tried this

I need to evaluate the relevance of GitHub contributors' comments to a specific issue specification. Specifically, I'm interested in how much each comment helps to further define the issue specification or contributes new information or research relevant to the issue. The following is the stringified dump of the entire issue and the conversation chain with each conversation delimited by <-----> The bot shows a warning message on tasks that were opened recently, as seen here. It should only display this warning above a certain threshold, which comes from the configuration.A possible cause would be that value missing in the current configuration. If that is the case, the default threshold should be set, and definitely above 0 days.Tasks to be carried out: display a warning message when the task is above the configuration threshold do not display the warning if it is under that threshold change the configuration to accept a string representing a duration instead of a time stamp related tests <----->@Keyrxng would this be the fix for it? ubiquity/work.ubq.fi#74 (comment)<----->yeah @gentlementlegen aae3cba resolved here in #10<---->Ok cool thank you, I will close this one after we got your fix merged.<----->So the font is wrong be sure to match the styles of the old assign message exactly.<----->I triple checked I cannot see any font difference.Maybe there is a difference on mobile devices if you are using one?<---->registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that<---->Use <code> not <samp><---->@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there.<---->een reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc.<---->f its only a problem on mobile then perhaps <samp> is the best decision!<---->Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it.If its only a problem on mobile then perhaps is the best decision! Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard<---->@gentlementlegen you want to add it into the spec/title here and I'll handle both?<---->@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not once created already. Please provide a float between 0 and 1 to represent the degree of relevance. A score of 1 indicates that the comment is entirely relevant and adds significant value to the issue, whereas a score of 0 indicates no relevance or added value. A stringified JSON is given below that contains the specification and contributors' comments. Each comment in the JSON has a unique ID and comment content. { "specification": "Do not show warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "Updating the password recovery process could help in user management." } ] } To what degree are each of the comments in the conversation relevant and valuable to further defining the issue specification? Please reply with ONLY a JSON where each key is the comment ID given in JSON above, and the value is a float number between 0 and 1 corresponding to the comment. The float number should represent the degree of relevance and added value of the comment to the issue. The total number of properties in your JSON response should equal exactly 3.

For GPT-4o:

{
  "1": 0.5,
  "2": 0.0,
  "3": 0.0
}

For GPT-3.5 Turbo:

{
  "1": 0.5,
  "2": 0.3,
  "3": 0.0
}

For Claude Sonnet3.5:

{
  "1": 0.2,
  "2": 0.4,
  "3": 0.0
}

I think GPT 4o is not a good model choice for this task.

0x4007 commented 2 days ago

Claude says the second comment is the most on topic which the GPTs disagree with. Thats a red flag. Also, there is a way to instruct GPT to return a JSON using headers, or some property name in the SDK. Don't specify in the prompt. I generally think that GPT is smarter but for massive context I've seen Claude do better on some tasks specifically related to coding.

sshivaditya2019 commented 2 days ago

Claude says the second comment is the most on topic which the GPTs disagree with. Thats a red flag. Also, there is a way to instruct GPT to return a JSON using headers, or some property name in the SDK. Don't specify in the prompt. I generally think that GPT is smarter but for massive context I've seen Claude do better on some tasks specifically related to coding.

"response_format" : {
                "type": "json_object" 
},

This will ensure it always returns an JSON. I think type also supports json_schema, so it would return json objects with a particular schema as well. I think this is supported only by GPT 4o for now.

For general tasks, OpenAI models work fine, but I think for this task if we are not using GPT-3.5 Turbo, we should should be using claude's Sonnet or Opus models.

0x4007 commented 2 days ago

we should should be using claude's Sonnet or Opus models.

Unfortunately we need @gentlementlegen playground feature to fine tune conversation reward config and the LLM details.

They have been cooking it up for a few days now.

sshivaditya2019 commented 2 days ago

we should should be using claude's Sonnet or Opus models.

Unfortunately we need @gentlementlegen playground feature to fine tune conversation reward config and the LLM details.

They have been cooking it up for a few days now.

There are some non LLM approaches as well, but I think they would return the same relevance score, intrinsically the comments are not very relevant to issue topic or body. Example If its only a problem on mobile then perhaps <samp> is the best decision! and topic of the issue was Do not show warning message on tasks that were not created a long time ago.

0x4007 commented 2 days ago

Then we need to adjust our prompt

sshivaditya2019 commented 2 days ago

Possible new prompt have to distill it further. Works fine with GPT4o


Instruction: 
Go through all the comments first keep them in memory, then say "I KNOW" and read the following prompt

OUTPUT FORMAT:
{ID: CONNECTION SCORE} For Each record, based on the average value from the CONNECTION SCORE from ALL COMMENTS, TITLE and BODY, one for each comment under evaluation
Global Context:
Specification: "Do not show a warning message on tasks that were not created a long time ago."
ALL COMMENTS:
Comment ID 1: "Ok cool thank you, I will close this one after we got your fix merged."
Comment ID 2: "So the font is wrong be sure to match the styles of the old assign message exactly."
Comment ID 3: "I triple checked I cannot see any font difference. Maybe there is a difference on mobile devices if you are using one?"
Comment ID 4: "Registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that."
Comment ID 5: "Use <code> not <samp>."
Comment ID 6: "@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there."
Comment ID 7: "Been reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc."
Comment ID 8: "If it's only a problem on mobile then perhaps <samp> is the best decision!"
Comment ID 9: "Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it."
Comment ID 10: "Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard."
Comment ID 11: "@gentlementlegen you want to add it into the spec/title here and I'll handle both?"
Comment ID 12: "@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not one created already."

IMPORTANT CONTEXT:
You have now seen all the comments made by other users, keeping the comments in mind think in what ways comments to be evaluated be connected. The comments that were related to the comment under evaluation might come after or before them in the list of all comments but they would be there in ALL COMMENTS. COULD BE BEFORE OR AFTER, you have diligently search through all the comments in ALL COMMENTS.

START EVALUATING:
Comments for Evaluation: { "specification": "Do not show a warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "I think this should because of a hardware bug or something in the OS." } ] }

POST EVALUATION:
THE RESULT FROM THIS SHOULD BE ONLY THE SCORES BASED ON THE FLOATING POINT VALUE CONNECTING HOW CLOSE THE COMMENT IS FROM ALL COMMENTS AND TITLE AND BODY.

From ALL COMMENTS, find the comments most relevant to each comment under evaluation and rank them in descending order for the top 3, for each comment. Though, you need to fill in details, in conjunctions with the title and spec info and ALL COMMENTS, it should be somewhat relevant to the overall topic. The comment should be closely related to something mentioned and discussed in the ALL COMMENTS and should be relevant to the central topic (TITLE AND ISSUE SPEC)

Now Assign them scores a float value ranging from 0 to 1, where 0 is spam (lowest value), and 1 is something that's very relevant (Highest Value), here relevance should mean a variety of things, it could be a fix to the issue, it could be a bug in solution, it could a SPAM message, it could be comment, that on its own does not carry weight, but when CHECKED IN ALL COMMENTS, may be a crucial piece of information for debugging and solving the ticket. If YOU THINK ITS NOT RELEATED to ALL COMMENTS or TITLE OR ISSUE SPEC, then give it a 0 SCORE.

OUTPUT:
RETURN ONLY A JSON with the ID and the connection score (FLOATING POINT VALUE) with ALL COMMENTS TITLE AND BODY for each comment under evaluation.  RETURN ONLY ONE CONNECTION SCORE VALUE for each comment
0x4007 commented 2 days ago

Regarding your prompt, I feel like it could be useful context for the LLM to know who is posting which comment. The issue author is most likely to remain on topic and to guide the conversation I think. We can eventually iterate the prompt to accommodate this?

Another idea to enhance the accuracy is to generate embeddings and then sort how on topic each comment is, and then finally assigning a descending floating point score to each by asking the LLM to do so.

Basically the embeddings can sort the order and then the LLM can finely score each, and it knows that it must be in descending order.

According to the embedding criteria I posted about we could also use anomaly detection and basically set them aside as off topic, and not score them.

I also feel like we could replace word count with the summarized points of comments. The idea is that it can potentially cut through the noise (verbose explanations) and credit for the salient points/ideas only. This needs more research though.

sshivaditya2019 commented 2 days ago

Regarding your prompt, I feel like it could be useful context for the LLM to know who is posting which comment. The issue author is most likely to remain on topic and to guide the conversation I think. We can eventually iterate the prompt to accommodate this?

Another idea to enhance the accuracy is to generate embeddings and then sort how on topic each comment is, and then finally assigning a descending floating point score to each by asking the LLM to do so.

Basically the embeddings can sort the order and then the LLM can finely score each, and it knows that it must be in descending order.

According to the embedding criteria I posted about we could also use anomaly detection and basically set them aside as off topic, and not score them.

I also feel like we could replace word count with the summarized points of comments. The idea is that it can potentially cut through the noise (verbose explanations) and credit for the salient points/ideas only. This needs more research though.

Voyage AI has Rerankers, we could try ranking comment vs (ALL Comment + Title + Body), and assign percentile based relevance scores. I think this is one possible solution, but it does not care about verbose/concise explanations.

ubiquity-os[bot] commented 1 day ago

@sshivaditya2019, this task has been idle for a while. Please provide an update.

gentlementlegen commented 1 day ago

I feel like so strange stuff is going on with the reminders, there was activity within the 3.5 last days on that issue. @Keyrxng rfc