Open 0x4007 opened 1 month ago
Made it a week to guarantee a good job. This is a core feature that needs to work at least as well as before.
Seems like test with known samples is a good next step here.
/start
Deadline | Fri, Sep 13, 11:02 AM UTC |
Registered Wallet | 0xDAba6e01D15Db560b88C8F426b016801f79e1F69 |
<ul>
<li>Use <code>/wallet 0x0000...0000</code> if you want to update your registered payment wallet address.</li>
<li>Be sure to open a draft pull request as soon as possible to communicate updates on your progress.</li>
<li>Be sure to provide timely updates to us when requested, or you will be automatically unassigned from the task.</li>
<ul>
@0x4007
@0x4007
- Could you share same sample comments which previously had high relevance scores ?
Unfortunately you or I would just have to manually check old completed tasks and see their rewards. None in particular come to mind, but I would pay attention to those posted by "ubiquibot" instead of "ubiquity-os" as those used an older version of conversation rewards that seemed more accurate.
- Also, Could you point me to the section of the code that gave credits for images ? If it does exist.
It is under the "formatting score" or "quantitative scoring" section. You might be able to search for these keywords in the codebase. I am mobile so pointing to code is not feasible. @gentlementlegen perhaps you can help with this point.
@sshivaditya2019, this task has been idle for a while. Please provide an update.
@sshivaditya2019, this task has been idle for a while. Please provide an update.
@gentlementlegen Really nice to see this finally working as expected. Except the revision hash in the metadata is undefined. This should be fixed!
@0x4007
I need to evaluate the relevance of GitHub contributors' comments to a specific issue specification. Specifically, I'm interested in how much each comment helps to further define the issue specification or contributes new information or research relevant to the issue. Please provide a float between 0 and 1 to represent the degree of relevance. A score of 1 indicates that the comment is entirely relevant and adds significant value to the issue, whereas a score of 0 indicates no relevance or added value. A stringified JSON is given below that contains the specification and contributors' comments. Each comment in the JSON has a unique ID and comment content.{ "specification": "Do not show warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "Updating the password recovery process could help in user management." } ] } To what degree are each of the comments in the conversation relevant and valuable to further defining the issue specification? Please reply with ONLY a JSON where each key is the comment ID given in JSON above, and the value is a float number between 0 and 1 corresponding to the comment. The float number should represent the degree of relevance and added value of the comment to the issue. The total number of properties in your JSON response should equal exactly 3.
I tried with this prompt, for models gpt-4o
, gpt-3.5-Turbo
, chatgpt
, almost all of the models give the same relevance values.I think the problem is that there isn't enough context. On its own, the comment might not seem relevant to the issue description and details
I would suggest a better approach would be reduce the temperature
and top_p
values, perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation. Following are the results which I got from the GPT-4o and has the same values has GPT 3.5-Turbo
{
"1": 0.1,
"2": 0.2,
"3": 0.0
}
Comment 1: "Ok cool thank you, I will close this one after we got your fix merged."
This comment is mostly administrative and does not directly contribute to defining the issue or adding new information. It is somewhat relevant because it acknowledges the fix, but it doesn't provide any new insights or suggestions. Relevance: 0.1
Comment 2: "So the font is wrong be sure to match the styles of the old assign message exactly."
This comment addresses a style issue (font) but does not directly relate to the core issue of warning messages or task creation timing. However, it could be tangentially related if the warning message involves UI elements, so it has some minor relevance. Relevance: 0.2
Comment 3: "Updating the password recovery process could help in user management."
This comment is entirely unrelated to the issue specification, which focuses on warning messages and task creation timing. It introduces a different topic (password recovery), so it has no relevance to the current issue. Relevance: 0.0
I would suggest a better approach would be reduce the temperature and top_p values,
Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)
perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.
I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.
I would suggest a better approach would be reduce the temperature and top_p values,
Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)
perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.
I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.
So I tested few examples a temperature
value of 0.2 works fine for now with GPT4o
model. I don't think the current implementation does that, prompt expects a { specification: issue, comments: comments }
object, with the comments being of type { id: number; comment: string }[]
. I can probably rewrite that part, if that's fine.
I would suggest a better approach would be reduce the temperature and top_p values,
Great idea except if temp is set too low I know it repeats and crashes. I'm pretty sure I played with these settings in my original implementation (see the repo called comment-incentives)
perhaps a better way would be evaluate all the comments together in a single block instead of testing them in isolation.
I'm pretty sure it's implemented this way. I know for a fact in my original implementation I had them all evaluate in one shot.
@gentlementlegen
Depends what is meant by "all together". Now it is all together, but by user, not all comment from the whole issue / PR in one block. Original implementation was the same except that there was a batch of 10 attemps averaged.
I see. Given the extended context windows of the latest models, perhaps we should do it all in one shot?
If that enhances precision and gives more context for better results it is nice, however I wonder if we would easily burst through the max tokens doing so for long reviews.
Context window is too long these days I am pretty sure it will be fine
@0x4007 I can rewrite the code for the original implementation, but a better way would be to provide a entire conversation dump and the user's conversations as well. I tried this
I need to evaluate the relevance of GitHub contributors' comments to a specific issue specification. Specifically, I'm interested in how much each comment helps to further define the issue specification or contributes new information or research relevant to the issue. The following is the stringified dump of the entire issue and the conversation chain with each conversation delimited by <-----> The bot shows a warning message on tasks that were opened recently, as seen here. It should only display this warning above a certain threshold, which comes from the configuration.A possible cause would be that value missing in the current configuration. If that is the case, the default threshold should be set, and definitely above 0 days.Tasks to be carried out: display a warning message when the task is above the configuration threshold do not display the warning if it is under that threshold change the configuration to accept a string representing a duration instead of a time stamp related tests <----->@Keyrxng would this be the fix for it? ubiquity/work.ubq.fi#74 (comment)<----->yeah @gentlementlegen aae3cba resolved here in #10<---->Ok cool thank you, I will close this one after we got your fix merged.<----->So the font is wrong be sure to match the styles of the old assign message exactly.<----->I triple checked I cannot see any font difference.Maybe there is a difference on mobile devices if you are using one?<---->registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that<---->Use <code> not <samp><---->@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there.<---->een reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc.<---->f its only a problem on mobile then perhaps <samp> is the best decision!<---->Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it.If its only a problem on mobile then perhaps is the best decision! Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard<---->@gentlementlegen you want to add it into the spec/title here and I'll handle both?<---->@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not once created already. Please provide a float between 0 and 1 to represent the degree of relevance. A score of 1 indicates that the comment is entirely relevant and adds significant value to the issue, whereas a score of 0 indicates no relevance or added value. A stringified JSON is given below that contains the specification and contributors' comments. Each comment in the JSON has a unique ID and comment content. { "specification": "Do not show warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "Updating the password recovery process could help in user management." } ] } To what degree are each of the comments in the conversation relevant and valuable to further defining the issue specification? Please reply with ONLY a JSON where each key is the comment ID given in JSON above, and the value is a float number between 0 and 1 corresponding to the comment. The float number should represent the degree of relevance and added value of the comment to the issue. The total number of properties in your JSON response should equal exactly 3.
For GPT-4o
:
{
"1": 0.5,
"2": 0.0,
"3": 0.0
}
For GPT-3.5 Turbo
:
{
"1": 0.5,
"2": 0.3,
"3": 0.0
}
For Claude Sonnet3.5
:
{
"1": 0.2,
"2": 0.4,
"3": 0.0
}
I think GPT 4o
is not a good model choice for this task.
Claude says the second comment is the most on topic which the GPTs disagree with. Thats a red flag. Also, there is a way to instruct GPT to return a JSON using headers, or some property name in the SDK. Don't specify in the prompt. I generally think that GPT is smarter but for massive context I've seen Claude do better on some tasks specifically related to coding.
Claude says the second comment is the most on topic which the GPTs disagree with. Thats a red flag. Also, there is a way to instruct GPT to return a JSON using headers, or some property name in the SDK. Don't specify in the prompt. I generally think that GPT is smarter but for massive context I've seen Claude do better on some tasks specifically related to coding.
"response_format" : {
"type": "json_object"
},
This will ensure it always returns an JSON. I think type also supports json_schema
, so it would return json objects with a particular schema as well. I think this is supported only by GPT 4o
for now.
For general tasks, OpenAI models work fine, but I think for this task if we are not using GPT-3.5 Turbo
, we should should be using claude's Sonnet
or Opus
models.
we should should be using claude's Sonnet or Opus models.
Unfortunately we need @gentlementlegen playground feature to fine tune conversation reward config and the LLM details.
They have been cooking it up for a few days now.
we should should be using claude's Sonnet or Opus models.
Unfortunately we need @gentlementlegen playground feature to fine tune conversation reward config and the LLM details.
They have been cooking it up for a few days now.
There are some non LLM approaches as well, but I think they would return the same relevance score, intrinsically the comments are not very relevant to issue topic or body. Example If its only a problem on mobile then perhaps <samp> is the best decision!
and topic of the issue was Do not show warning message on tasks that were not created a long time ago
.
Then we need to adjust our prompt
Possible new prompt have to distill it further. Works fine with GPT4o
Instruction:
Go through all the comments first keep them in memory, then say "I KNOW" and read the following prompt
OUTPUT FORMAT:
{ID: CONNECTION SCORE} For Each record, based on the average value from the CONNECTION SCORE from ALL COMMENTS, TITLE and BODY, one for each comment under evaluation
Global Context:
Specification: "Do not show a warning message on tasks that were not created a long time ago."
ALL COMMENTS:
Comment ID 1: "Ok cool thank you, I will close this one after we got your fix merged."
Comment ID 2: "So the font is wrong be sure to match the styles of the old assign message exactly."
Comment ID 3: "I triple checked I cannot see any font difference. Maybe there is a difference on mobile devices if you are using one?"
Comment ID 4: "Registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that."
Comment ID 5: "Use <code> not <samp>."
Comment ID 6: "@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there."
Comment ID 7: "Been reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc."
Comment ID 8: "If it's only a problem on mobile then perhaps <samp> is the best decision!"
Comment ID 9: "Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it."
Comment ID 10: "Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard."
Comment ID 11: "@gentlementlegen you want to add it into the spec/title here and I'll handle both?"
Comment ID 12: "@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not one created already."
IMPORTANT CONTEXT:
You have now seen all the comments made by other users, keeping the comments in mind think in what ways comments to be evaluated be connected. The comments that were related to the comment under evaluation might come after or before them in the list of all comments but they would be there in ALL COMMENTS. COULD BE BEFORE OR AFTER, you have diligently search through all the comments in ALL COMMENTS.
START EVALUATING:
Comments for Evaluation: { "specification": "Do not show a warning message on tasks that were not created a long time ago", "comments": [ { "id": 1, "comment": "Ok cool thank you, I will close this one after we got your fix merged." }, { "id": 2, "comment": "So the font is wrong be sure to match the styles of the old assign message exactly." }, { "id": 3, "comment": "I think this should because of a hardware bug or something in the OS." } ] }
POST EVALUATION:
THE RESULT FROM THIS SHOULD BE ONLY THE SCORES BASED ON THE FLOATING POINT VALUE CONNECTING HOW CLOSE THE COMMENT IS FROM ALL COMMENTS AND TITLE AND BODY.
From ALL COMMENTS, find the comments most relevant to each comment under evaluation and rank them in descending order for the top 3, for each comment. Though, you need to fill in details, in conjunctions with the title and spec info and ALL COMMENTS, it should be somewhat relevant to the overall topic. The comment should be closely related to something mentioned and discussed in the ALL COMMENTS and should be relevant to the central topic (TITLE AND ISSUE SPEC)
Now Assign them scores a float value ranging from 0 to 1, where 0 is spam (lowest value), and 1 is something that's very relevant (Highest Value), here relevance should mean a variety of things, it could be a fix to the issue, it could be a bug in solution, it could a SPAM message, it could be comment, that on its own does not carry weight, but when CHECKED IN ALL COMMENTS, may be a crucial piece of information for debugging and solving the ticket. If YOU THINK ITS NOT RELEATED to ALL COMMENTS or TITLE OR ISSUE SPEC, then give it a 0 SCORE.
OUTPUT:
RETURN ONLY A JSON with the ID and the connection score (FLOATING POINT VALUE) with ALL COMMENTS TITLE AND BODY for each comment under evaluation. RETURN ONLY ONE CONNECTION SCORE VALUE for each comment
Regarding your prompt, I feel like it could be useful context for the LLM to know who is posting which comment. The issue author is most likely to remain on topic and to guide the conversation I think. We can eventually iterate the prompt to accommodate this?
Another idea to enhance the accuracy is to generate embeddings and then sort how on topic each comment is, and then finally assigning a descending floating point score to each by asking the LLM to do so.
Basically the embeddings can sort the order and then the LLM can finely score each, and it knows that it must be in descending order.
According to the embedding criteria I posted about we could also use anomaly detection and basically set them aside as off topic, and not score them.
I also feel like we could replace word count with the summarized points of comments. The idea is that it can potentially cut through the noise (verbose explanations) and credit for the salient points/ideas only. This needs more research though.
Regarding your prompt, I feel like it could be useful context for the LLM to know who is posting which comment. The issue author is most likely to remain on topic and to guide the conversation I think. We can eventually iterate the prompt to accommodate this?
Another idea to enhance the accuracy is to generate embeddings and then sort how on topic each comment is, and then finally assigning a descending floating point score to each by asking the LLM to do so.
Basically the embeddings can sort the order and then the LLM can finely score each, and it knows that it must be in descending order.
According to the embedding criteria I posted about we could also use anomaly detection and basically set them aside as off topic, and not score them.
I also feel like we could replace word count with the summarized points of comments. The idea is that it can potentially cut through the noise (verbose explanations) and credit for the salient points/ideas only. This needs more research though.
Voyage AI has Rerankers, we could try ranking comment vs (ALL Comment + Title + Body), and assign percentile based relevance scores. I think this is one possible solution, but it does not care about verbose/concise explanations.
@sshivaditya2019, this task has been idle for a while. Please provide an update.
I feel like so strange stuff is going on with the reminders, there was activity within the 3.5 last days on that issue. @Keyrxng rfc
This prompt works fine, I have tried with multiple examples
Instruction:
Go through all the comments first keep them in memory, then start with the following prompt
OUTPUT FORMAT:
{ID: CONNECTION SCORE} For Each record, based on the average value from the CONNECTION SCORE from ALL COMMENTS, TITLE and BODY, one for each comment under evaluation
Global Context:
Specification: "Do not show a warning message on tasks that were not created a long time ago."
ALL COMMENTS:
Comment ID 1: "Ok cool thank you, I will close this one after we got your fix merged."
Author: @gentlementlegen
Comment ID 2: "So the font is wrong be sure to match the styles of the old assign message exactly."
Author: @0x4007
Comment ID 3: "I triple checked I cannot see any font difference. Maybe there is a difference on mobile devices if you are using one?"
Author: @gentlementlegen
Comment ID 4: "Registered wallet seems to wrap only when the warning is visible for some reason, nothing springs to mind on how to fix that."
Author: @Keyrxng
Comment ID 5: "Use <code> not <samp>."
Author: @0x4007
Comment ID 6: "@0x4007 Thanks for the screenshot, it seems to be more specific to mobile for the font. Maybe a font fallback happening there."
Author: @gentlementlegen
Comment ID 7: "Been reading through the GFM and CommonMark specs, GH docs etc but haven't been able to find any other workarounds other than <samp>. We'll just have to live with the whitespace if using <code> blocks so which of the below options is best, included <samp> for easy comparison. I can't find any other workarounds with these tags or others such as <kbd> etc."
Author: @Keyrxng
Comment ID 8: "If it's only a problem on mobile then perhaps <samp> is the best decision!"
Author: @0x4007
Comment ID 9: "Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it."
Author: @gentlementlegen
Comment ID 10: "Yeah it's only mobile that <samp> seems to have slightly different rendering Number 3 seems to be the best display in your propositions, if that works on mobile then let's go with it. I agree it looks best but no it is somewhat different although personally I think <samp> is the way to go because it'll most commonly be seen via desktop vs mobile and I'd rather no whitespace than to accommodate the lesser used platform. Sounds like option 3 is preferred and that's how the code is right now so no changes needed in this regard."
Author: @Keyrxng
Comment ID 11: "@gentlementlegen you want to add it into the spec/title here and I'll handle both?"
Author: @Keyrxng
Comment ID 12: "@Keyrxng Done. Also, I guess the font issue should be carried out in a different issue most likely, if there is not one created already."
Author: @gentlementlegen
IMPORTANT CONTEXT:
You have now seen all the comments made by other users, keeping the comments in mind think in what ways comments to be evaluated be connected. The comments that were related to the comment under evaluation might come after or before them in the list of all comments but they would be there in ALL COMMENTS. COULD BE BEFORE OR AFTER, you have diligently search through all the comments in ALL COMMENTS.
START EVALUATING:
Comments for Evaluation: {"specification":"Do not show a warning message on tasks that were not created a long time ago","comments":[{"id":1,"comment":"Ok cool thank you, I will close this one after we got your fix merged.","author":"@gentlementlegen"},{"id":2,"comment":"So the font is wrong be sure to match the styles of the old assign message exactly.","author":"@0x4007"},{"id":3,"comment":"I think this should because of a hardware bug or something in the OS.","author":"@random"}]}
POST EVALUATION:
THE RESULT FROM THIS SHOULD BE ONLY THE SCORES BASED ON THE FLOATING POINT VALUE CONNECTING HOW CLOSE THE COMMENT IS FROM ALL COMMENTS AND TITLE AND BODY.
Now Assign them scores a float value ranging from 0 to 1, where 0 is spam (lowest value), and 1 is something that's very relevant (Highest Value), here relevance should mean a variety of things, it could be a fix to the issue, it could be a bug in solution, it could a SPAM message, it could be comment, that on its own does not carry weight, but when CHECKED IN ALL COMMENTS, may be a crucial piece of information for debugging and solving the ticket. If YOU THINK ITS NOT RELEATED to ALL COMMENTS or TITLE OR ISSUE SPEC, then give it a 0 SCORE.
OUTPUT:
RETURN ONLY A JSON with the ID and the connection score (FLOATING POINT VALUE) with ALL COMMENTS TITLE AND BODY for each comment under evaluation. RETURN ONLY ONE CONNECTION SCORE VALUE for each comment
it returns
{
"1": 0.8,
"2": 0.6,
"3": 0.0
}
@sshivaditya2019, this task has been idle for a while. Please provide an update.
@0x4007
I am trying to set this up. But process.issue.test.ts
is failing for me. Is there something particular I need to set up apart from the .env
.
@sshivaditya2019 Could you elaborate on your issue? also it would help if you open a PR as a draft so we can see what's going on.
@sshivaditya2019 Could you elaborate on your issue? also it would help if you open a PR as a draft so we can see what's going on.
In process.issue.test.ts
, the tests for Should generate permits
and Should generate GitHub comment
are failing, even before I made any changes. Is this expected?
For the Should generate permits
test, Jest is reporting an error indicating that permitUrl
is missing from the processor.dump
.
@sshivaditya2019 You can see in the pull-request you linked that tests are passing successfully, my best is some problem in your local setup. Maybe you have some environment variables missing, or some configuration problems?
If the permit URL is missing it is probably because permits failed to generate. I don't know if you have more detailed errors, but as long as it works on your linked pull-request you shouldn't need to worry much about your local tests.
@gentlementlegen, Is there any way to check relevances
locally ?. Most of the tests have a mocked implementation of _evaluateComments
To check them locally, you will need to use your own OpenAI key to get ChatGpt to run. What I do on my side is have a local Jest test that allows me to run it on any issue. Please check out this Gist it might help you.
To check them locally, you will need to use your own OpenAI key to get ChatGpt to run. What I do on my side is have a local Jest test that allows me to run it on any issue. Please check out this Gist it might help you.
This test was very helpful and should definitely be included in the testing suite.
The problem is that this tests runs on real data which is subject to change. What could be done is including this test but excluding it from the test run. However, there is this issue that should eventually get that covered, didn't have time to look into it.
The problem is that this tests runs on real data which is subject to change. What could be done is including this test but excluding it from the test run. However, there is this issue that should eventually get that covered, didn't have time to look into it.
Let me know if you’re not working on it right now, I can take a look at it then.
@0x4007. I think the issue spec for this one is a bit vague. I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket.
Is the goal for the ticket to rewrite the entire comment-evaluator-module
with the embeddings
and vector search
? Or is it something else ?
@0x4007. I think the issue spec for this one is a bit vague.
For vague specs which happen occasionally, we are to share research and concerns on here. We all get credited for it.
I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket.
Is the goal for the ticket to rewrite the entire
comment-evaluator-module
with theembeddings
andvector search
? Or is it something else ?
Some recent observations:
@0x4007. I think the issue spec for this one is a bit vague.
For vague specs which happen occasionally, we are to share research and concerns on here. We all get credited for it.
I think relevance issue should be fixed with the PR. The img credit is a configuration issue. But, I am not sure about the outcomes for this ticket. Is the goal for the ticket to rewrite the entire
comment-evaluator-module
with theembeddings
andvector search
? Or is it something else ?Some recent observations:
- We are very likely going to entirely remove the word counter feature and instead generate embeddings and understand how much value a comment adds to solving the problem.
@0x4007 Embedding will not solve the problem, I think the present relevance scoring is the best technique. According to me a better approach would be to use something like Bag of Words model with hierarchal labeling, and assign scores according to the depth of the concept. Let me know about this I can put together a small write up on this.
- We need to change the code to emphasize crediting per HTML tag and then the meaning of the comment with embeddings instead of word count.
I think present implementation focuses only on formatting. Effectively this would mean that entire formatting-evaluator-module.ts
being rewritten.
- We still need to test how far we can get regarding vector embeddings and how well it solves our problem.
I am very skeptical about embeddings in this use case. As I mentioned before, embeddings provide local context, and references, and on their own would not mean anything. I created my own script to plot visualize the embeddings and perform PCA to extract cluster centers.
Original Comments:
Embeddings Plot with Comments:
Here, you can see three distinctive cluster centers, In the embeddings plot with comment I have added a new comment Something random blah blah
, as you can see it is near a cluster center, and would have high similarity in vector search. This should not happen. My suggestion would be to use a nlp method instead of embedding based vector search. Let me know if you want to set this up on your end I can help you with that.
My peer suggested some search engine results related algorithm. I'm asking him now to clarify which. This should help us see how on topic it is for the specification. We could consider adding this as one of several dimensions we evaluate the comments by.
Starting to wonder if sub plugins are realistic, or just make npm modules (or if we should just use something like git modules)
ChatGPT is recommending me:
We are exploring methods to evaluate comment readability and conciseness using Flesch-Kincaid readability metrics. These formulas assess the complexity and clarity of text based on sentence length and word syllables:
- Flesch Reading Ease: Rates text on a scale from 0 (difficult) to 100 (easy). Scores can help determine how easily a comment can be understood.
- Flesch-Kincaid Grade Level: Converts readability into an estimated school grade level required to comprehend the text.
These metrics can help us identify verbose comments vs. concise, high-value inputs
This seems a lot more interesting compared to word count but we should test.
The idea is that we can develop a proprietary algorithm that combines several strategies. Ideally we should make a playground that we can plugin these different modules and run tests against live GitHub issues to tweak it
Strategy ideas:
@gentlementlegen
conversation-rewards
already supports modules within itself that you can enable / disable to change the final output. You can do as many transforming modules as you want and enable / disable them through the configuration.
My peer got back to me regarding the search engine recommendation
TF-IDF (Term Frequency-Inverse Document Frequency) is a classic algorithm used in search and information retrieval to evaluate how important a word is to a document relative to a collection of documents (often referred to as a "corpus"). It helps identify which terms are most relevant to the context of a specific document.
Given your objective to measure the value of GitHub comments in relation to problem-solving, TF-IDF could be a useful tool to assess the relevance and informational density of individual comments with respect to the overall issue or conversation.
Here's how TF-IDF might be applied in your scenario:
\[
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
\]
\[
\text{IDF}(t) = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
\]
\[
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
\]
Identifying Key Terms in Comments:
Assessing Relevance to the Issue Description:
Filtering Out Low-Value Contributions:
Weighted Relevance Scores: Use TF-IDF scores to assign a relevance weight to each comment, allowing you to rank comments on a continuum of importance rather than using binary relevance.
Combining with Other Metrics: Integrate TF-IDF scores with other continuous metrics (e.g., semantic similarity, readability) to create a comprehensive scoring system that reflects both the specificity and value of a comment.
Using TF-IDF will give you an effective way to measure the informational value and relevance of comments, aligning well with your goal of continuum-based scoring. Let me know if you’d like to dive deeper into any specific aspect of this approach!
My peer got back to me regarding the search engine recommendation
TF-IDF (Term Frequency-Inverse Document Frequency) is a classic algorithm used in search and information retrieval to evaluate how important a word is to a document relative to a collection of documents (often referred to as a "corpus"). It helps identify which terms are most relevant to the context of a specific document.
In this case, I am not sure how this is relevant. Here, we are assigning scores, within a Comment Thread
Context and the comments are mutually exclusive from other Comments Thread
, in terms of assigning relevance.
In the Context of Your Goals: Evaluating GitHub Comments
Given your objective to measure the value of GitHub comments in relation to problem-solving, TF-IDF could be a useful tool to assess the relevance and informational density of individual comments with respect to the overall issue or conversation.
Here's how TF-IDF might be applied in your scenario:
1. How TF-IDF Works
- Term Frequency (TF): Measures how frequently a term appears in a comment. Higher frequencies suggest that the term is more important within that comment.
2. Applying TF-IDF to Evaluate Comment Relevance:
Identifying Key Terms in Comments:
- TF-IDF will highlight terms in each comment that are not just common but are also distinctive in the context of the overall issue. This helps identify comments with unique and relevant insights.
Assessing Relevance to the Issue Description:
- By comparing the TF-IDF scores of words in a comment to those in the issue description, you can measure how closely a comment aligns with the core problem. Comments with terms that have high TF-IDF relevance scores relative to the issue description are more likely to be valuable.
Just for context, TF-IDF is a transformation technique, not would give out a real valued vector. With which we then apply some distance metric like cosine similarity. This is almost similar to the Embedding and Vector Search.
Filtering Out Low-Value Contributions:
- Comments that consist primarily of high-TF but low-IDF terms (e.g., generic phrases or filler words) can be identified as less valuable. This is particularly useful for identifying verbose comments from junior developers that lack unique insights.
These are not fixed. In the linked issue spec, Comments were relevant to the topic, but were flagged irrelevant. This will not be a issue as we would either way implement stemming or lemmatizing the input phrases and be tagging for POS (Parts of Speech).
3. Enhancing Your Continuum-Based Scoring System:
- Weighted Relevance Scores: Use TF-IDF scores to assign a relevance weight to each comment, allowing you to rank comments on a continuum of importance rather than using binary relevance.
- Combining with Other Metrics: Integrate TF-IDF scores with other continuous metrics (e.g., semantic similarity, readability) to create a comprehensive scoring system that reflects both the specificity and value of a comment.
Practical Steps for Implementation:
- Preprocess the Data: Tokenize the comments and issue descriptions, remove stop words, and normalize the text (e.g., lowercase conversion).
- Calculate TF-IDF: Apply TF-IDF to generate relevance scores for each comment.
- Score Aggregation: Aggregate the TF-IDF scores to quantify each comment’s overall contribution to solving the issue.
Benefits for Your Goals:
- Objective Measurement of Relevance: TF-IDF provides a quantitative way to gauge how closely comments relate to the problem at hand.
- Filtering Out Noise: Helps distinguish between high-value contributions and generic or off-topic comments.
I don't think this is possible. We would need to have some dictionary or something (WordNet), to assign values for words. This would not cater to specific words in comments like for eg: bug
or fix
, these words on its own will not have any value, and may appear as off topic.
- Complementary to Other Techniques: Can be combined with PageRank, readability scores, or semantic similarity measures for a more holistic evaluation.
Using TF-IDF will give you an effective way to measure the informational value and relevance of comments, aligning well with your goal of continuum-based scoring. Let me know if you’d like to dive deeper into any specific aspect of this approach!
TF-IDF is a good starting point, but I don't believe it suits this problem well. We need to assign scores or relevances to comments, and since no two comment threads will have the same set of high TF-IDF words, this could penalize terms that are highly relevant to the context individually but not as whole for multiple comment threads.
I came up with a new approach to categorize comments into topic bins (Topic would be added using LLM/ML Model). We can then perform a similarity search using the topic
, issue_title
, and issue_body
to generate a Topic-Comment-Alignment score for each comment.
Next, we can assess user engagement for each comment based on various roles, such as reactions and replies. The weight assigned to different types of engagement can vary depending on the role (e.g., Author
, Collaborator
, etc.).
Additionally, we’ll incorporate a credibility score to evaluate whether a comment was made by a verified member of the organization, a regular collaborator, or an unknown user.
The overall score could be calculated using the following formula:
$$ \text{Final Score} = \frac{(TCA \times E + C)}{W} $$
Where:
This will allow us to effectively evaluate the quality and relevance of comments.
Credibility score we can adjust to the author no matter their position/relationship with the organization.
The spec author generally has the clearest vision on the task so if what is commented aligns with them (i.e. they agree) then more credit should be offered (this of course is only in the context of funded tasks)
Reactions we usually have a very limited amount of but I think reactions from author and core team could be a positive indicator.
If we can attribute block quotes that can be interesting. The problem there is that I generally comment from mobile and block quotes can be inconvenient but sometimes I make sure to do in order to enhance clarity. I would be more curious to experiment with attributing block quote crediting.
Credibility score we can adjust to the author no matter their position/relationship with the organization.
The spec author generally has the clearest vision on the task so if what is commented aligns with them (i.e. they agree) then more credit should be offered (this of course is only in the context of funded tasks)
Reactions we usually have a very limited amount of but I think reactions from author and core team could be a positive indicator.
If we can attribute block quotes that can be interesting. The problem there is that I generally comment from mobile and block quotes can be inconvenient but sometimes I make sure to do in order to enhance clarity. I would be more curious to experiment with attributing block quote crediting.
Otherwise is the method and scoring criteria fine ? @Keyrxng rfc, I think this should be good enough
@sshivaditya2019, this task has been idle for a while. Please provide an update.
Qualitative and quantitative analysis have unexpected results according to how I implemented in v1. Research, and refine.
Originally posted by @0x4007 in https://github.com/ubiquibot/command-start-stop/issues/14#issuecomment-2308672581