Recently, when reproducing the performance of KB-BINDER on GrailQA, I used gpt_3.5_turbo_0613_16k with a temperature parameter set to 0.7. The EM results were 38.4 for composite types, 71.2 for i.i.d. types, and 46 for zero-shot types.
The particularly low performance on composite types raises questions. I'm wondering if anyone has any thoughts on this? Perhaps this is normal when reproducing the results.
Maybe it is because the original paper uses code-davinci which is not available now. Using gpt-3.5 to reproduce the results will lead to different result.
Recently, when reproducing the performance of KB-BINDER on GrailQA, I used gpt_3.5_turbo_0613_16k with a temperature parameter set to 0.7. The EM results were 38.4 for composite types, 71.2 for i.i.d. types, and 46 for zero-shot types. The particularly low performance on composite types raises questions. I'm wondering if anyone has any thoughts on this? Perhaps this is normal when reproducing the results.