Our current evaluation dataset is very big and too much for manual inspection of each iteration. As suggested in the Feedback from 2023-12-05 (#49) we should create about 10 examples for the following three levels of difficulty:
[x] Complete comments, and simple operations from comments (e.g. math operators or builtin functions like sum())
[x] Complete code that requires recent packages ($\rightarrow$ test dynamic knowledge)
[x] Sparse context (i.e. no guidance, missing context / intent, consider, but consider freedom of choice $\rightarrow$ do bias analysis, BERTopic)
Our current evaluation dataset is very big and too much for manual inspection of each iteration. As suggested in the Feedback from 2023-12-05 (#49) we should create about 10 examples for the following three levels of difficulty:
sum()
)