microsoft / CodeT

MIT License
590 stars 76 forks source link

Issue with token length of input prompt in codegen-inference.py #40

Open Ling-JM opened 1 month ago

Ling-JM commented 1 month ago

Thank you very much for the contributions of the authors. While attempting to implement the RepoCoder method, we encountered an issue in the codegen_inference.py file. When we tried to use the prompts/rg-one-gram-ws-20-ss-2.jsonl file with the codegen-350M-mono model for code generation, we encountered an error. 2Z_49OMRZ0934O}`N5YWT00

I also carefully reviewed issue #32 and understood the solution. However, when I reduce the length of the retrieved context or remove context from the starting lines, I cannot reproduce the results you obtained in Table 2 (a, b). Could you provide a more detailed solution or the code you used to achieve those results?

pppyb commented 1 month ago

Hi, I was the one who raised the problem #32 , but the solutions I took didn't actually solve my problem at the time. I finally found that I forgot to change the tokenizer into the right one in two place, like change the codex into codegen https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/run_pipeline.py#L30. https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/run_pipeline.py#L46 I don't know if you are in the same situation. I think this maybe can solve your problem.