hkust-zhiyao / RTL-Coder

A new LLM solution for RTL code generation, achieving state-of-the-art performance in non-commercial solutions and outperforming GPT-3.5.
101 stars 9 forks source link

Reproducing Results #10

Open j40903272 opened 2 months ago

j40903272 commented 2 months ago

Hi,

Thanks for nice work! I have two questions about reproducing the results.

First of all, is there a script for generating the gpt4 results? I got 47.7 pass@5 for gpt-4o and slightly worse for gpt-4-turbo. I understand that gpt4 is constantly updating but I would like to have the results as aligned as possible.

Second, the paper mentioned that the results are selected using three different temperatures {0.2, 0.5, 0.8}. Is the best performance selected problem-wise or selected out of three scores?

Thank you and looking forward to your reply.

DevinShang commented 1 month ago

Hi, Thanks for your interest! Regarding your first question, I guess you are referring to the functionality score of GPT4 on RTLLM1.1. During our experiments, we found that GPT4's generated results for the prompt in rtllm1.1 often contain code that is not clean, with uncommented extra contents interspersed within the code, which can lower the code's pass rate. Therefore, we manually removed irrelevant content from its generated results. As for your second question, we generated code for each score (pass@1, 5, 10) using three different temperatures {0.2, 0.5, 0.8}, and then selected the best pass rate among the three temperature configurations for the corresponding score.

Hope this can help you.