ubiquibot / conversation-rewards

0 stars 10 forks source link

feat: rewards are evaluated from the user's comments and implications for a given issue #3

Closed gentlementlegen closed 3 months ago

gentlementlegen commented 3 months ago

Resolves #2

0x4007 commented 3 months ago

I'll make a coffee and check on my computer. This deserves a proper review

gentlementlegen commented 3 months ago

@pavlovcik opening this pull request early so you can have an idea of what's going on, and where we at. The idea behind the new architecture of this module is to have a chain of Transformers that you can enable / disable and configure individually. It makes it easier to further add new ones and test them individually. Configuration is basic at the moment, but I thought further down we could add a decorator system or something similar to easily configure every setting.

The output is a json formatted string, that can be saved to a file or directly printed on the stdout. it makes it simple for any other program / action to retrieve the final output.

Also, the evaluation of the comments is done one by one instead of sending the whole list of comments, in order to reduce payloads and stick to gpt 3.5. I don't know how much changes this introduces about results. Currently, results are not batched either, I do not know how much difference that makes, let me know if you found large divergences in the results by doing so.

What is left to do

0x4007 commented 3 months ago

Consider resolving conversations as you complete those.

gentlementlegen commented 3 months ago

@pavlovcik I added details in the readme showing the current structure of the result, and the calculation that is applied. Not sure if it is 100% identical to the current bot, most results are close but not quite the same. Feel free to try it as the whole workflow should be functional.

0x4007 commented 3 months ago

What's the best way to test? Maybe making a local development server is most efficient?

gentlementlegen commented 3 months ago

What's the best way to test? Maybe making a local development server is most efficient?

Just running the tests is sufficient to test the logic itself, you can give the issue link you want on the env. Also, when testing currently chat gpt is disabled to allow quick tests but you can re enable it.

Maybe the best way to have real test cases is to mock github api as we did before, so we can have expected results. I will add this as well.

0x4007 commented 3 months ago

It probably would make sense to make an action to test.

Then I can see from mobile

gentlementlegen commented 3 months ago

@pavlovcik because the Jest file is read from the develop branch it doesn't offer the workflow dispatch before we merge it. Here is a run on my own Action repo: https://github.com/FernandVEYRIER/conversation-rewards/actions/runs/8370670552/job/22918418831

0x4007 commented 3 months ago

https://github.com/FernandVEYRIER/conversation-rewards/actions/runs/8370670552/job/22918418831#step:5:16

This 0001 signals to me that you are not using BigNumber for your math operations. This is necessary for exactly this reason in javascript.

Otherwise it looks great! Good work.

gentlementlegen commented 3 months ago

@pavlovcik Thank you. I fixed the Decimal floating point values with Decimal.js and reran the Jest Action with a real OpenAI key so you can see real results here https://github.com/gentlementlegen/conversation-rewards/actions/runs/8384768773/job/22962616529

0x4007 commented 3 months ago

Reviewing the action run data again: I noticed that the relevance scores seem to be with only one significant digit. This expresses to me that you are not running multiple samples and averaging them. Please make sure to do that because ChatGPT returns outliers.

"relevance": 0.2
"relevance": 0.1
"relevance": 0.2
"relevance": 0.8
"relevance": 0.2
"relevance": 0.1
"relevance": 0.2
"relevance": 0.2
"relevance": 0.6
"relevance": 0.1
"relevance": 0.2
"relevance": 0.2
"relevance": 0.9
"relevance": 0.4
"relevance": 0.8

In my implementation I run 10x samples per comment then return an average.

This approach is a bit janky but it seems the simplest to implement and fairly accurate.

gentlementlegen commented 3 months ago

@pavlovcik I can add the 10x sample no problem. Just wanted to mention that however since I split the comments to handle the length thing with GPT, that would run 10 x comments.length x user, would that be ok in terms of budget?

Here is the latest run with the sampling https://github.com/gentlementlegen/conversation-rewards/actions/runs/8387580202/job/22970034091

0x4007 commented 3 months ago

would that be ok in terms of budget

I have no idea. For reference, previously we were averaging $5 a month with the old set up. This new setup implies 10x? So thats $50 a month which is quite steep for raw costs.

I wonder if there is a more efficient approach to this. What about the old way of batching the entire conversation in one go?

gentlementlegen commented 3 months ago

@pavlovcik It is billed per token not per API call so this asumption is not true. But since we give a full prompt every time (so plenty of tokens) the cost increase would be: Old: (prompt.length + content.length * n) * 10 New: (prompt.length + content.length) * n * 10

So around a 300% increase in tokens. However, it would have less chances to need to switch to a bigger model like the 16k or gpt4. This being said gpt4 pricing is also different so we cannot say exactly, we only can experiment. This being said yes I could re-implement the way it was before as well if you think we gonna get better pricing!

0x4007 commented 3 months ago

Let's aim to make the max increase for now to $10 a month and go from there. Take a best guess doesn't matter if it's totally accurate.

gentlementlegen commented 3 months ago

@pavlovcik I rewrote the evaluation logic to use the same system as before e.g. sending the full batch of comments and averaging the whole result. Thus the costs should remain in the same range. Let's keep an eyes on this just in case!

https://github.com/gentlementlegen/conversation-rewards/actions/runs/8399447370/job/23005555201

0x4007 commented 3 months ago

@whilefoo can you confirm that this is good to go and then we can merge

0x4007 commented 3 months ago

@whilefoo can you confirm that this is good to go and then we can merge

Thanks for the thorough review! However I was hoping you would be able to provide guidance on how the kernel will interface with this plugin @whilefoo or do you think this is pretty much ready for testing?

0x4007 commented 3 months ago

@gentlementlegen just a heads up but your local git client seems out of sync with your GitHub profile somehow. There are no more avatars next to your commits.