kyegomez / tree-of-thoughts

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%
https://discord.gg/qUtxnK2NMf
Apache License 2.0
4.29k stars 361 forks source link

Evaluations from Paper #9

Closed kyegomez closed 1 year ago

kyegomez commented 1 year ago

Action Item: Create a list of example evaluations and measure performance with the following from the paper https://arxiv.org/pdf/2305.10601.pdf

some potential eval metrics:

Accuracy - Measure how often the model correctly predicts the output.

F1 Score - Measure the trade-off between precision and recall.

Precision - Measure the number of correctly predicted positive results out of the total number of positive predictions.

Recall - Measure the number of correctly predicted positive results out of the total number of actual positive instances.

Mean Average Precision (mAP) - Measure how well the model performs across multiple classes.

Receiver Operating Characteristic (ROC) - Measure the true positive rate against the false positive rate.

Area Under Curve (AUC) - Measure the ROC curve's performance.

Mean Squared Error (MSE) - Measure the average squared difference between the predicted and actual values.

Mean Absolute Error (MAE) - Measure the average absolute difference between the predicted and actual values.

kyegomez commented 1 year ago

1 Game of 24 Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10 - 4) (13 - 9) = 24”.

kyegomez commented 1 year ago

Set up parallel experiments with other reasoning techniques like cot,

" Method Success IO prompt 7.3% CoT prompt 4.0% CoT-SC (k=100) 9.0% ToT (ours) (b=1) 45% ToT (ours) (b=5) 74% IO + Refine (k=10) 27% IO (best of 100) 33% CoT (best of 100) 49% "