Closed pppyb closed 4 months ago
Sometimes the tokenizer won't return the same number of tokens. So you may loose the token limitation a bit in https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/build_prompt.py#L15 and L18. For example, only allow 1900/900 tokens of the retrieved content for Codex/CodeGen.
Thank you, Fengji. I followed your suggestions and made the modifications, but I am still encountering the same error; it doesn't seem to work. It only changed the retrieval context length, and the generated prompt still exceeds 2048.
Additionally, in codegen-inference.py, the truncation=True in this line prompts = self.tokenizer(prompt_batch, return_tensors='pt', padding=True, truncation=True) does not seem to function for some reason. If I manually force truncation, the quality of the generated code is very poor.
Since the RuntimeError occurs within the internals of the transformers model, it seems there is no direct way to access the tensor causing the error:
Traceback (most recent call last):
File "codegen_inference.py", line 78, in
Do you have any better solutions for the Salesforce/codegen-350M-mono model?
if you use the force truncation by Tokenizer, it will change the last line of code, thus affecting the target hole of code completion. A better way is to reduce the length of the retrieved context or remove the in-file context from the beginning lines. You have to check more carefully whether the 'rg-one-gram-ws-20-ss-24.jsonl' is generated correctly since the code here https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/build_prompt.py#L77 explicitly controls the length of prompt.
Thank you very much for your prompt response! I appreciate your suggestions and will attempt both solutions. However, I have some concerns regarding the process, as I am trying to replicate the results you presented in your paper on the codegen-350M-mono model, specifically those in Table 2 (a, b).
I followed the instructions provided in the README file meticulously and made no changes to the code logic other than modifying the hardcoded input paths. The steps I followed are as outlined below:
run_RG1_and_oracle_method
function in run_pipeline.py
to generate prompts/rg-one-gram-ws-20-ss-2.jsonl
.codegen_inference.py
to produce the prediction file: prompts/rg-one-gram-ws-20-ss-2_codegen-350M-mono.jsonl
.prediction_path
in run_pipeline.py
to the newly generated prediction file and reran the run_RepoCoder_method
in run_pipeline.py
to obtain prompts/repocoder-one-gram-ws-20-ss-2.jsonl
.prompts/repocoder-one-gram-ws-20-ss-2.jsonl
as an input to run codegen_inference.py
again to get the results for the repocoder algorithm.Could you please let me know if you encountered any issues when you achieved the results documented in your paper? I am concerned there may be an issue with my process since I am strictly adhering to the steps mentioned without altering any fundamental code logic.
Your guidance on this matter would be greatly appreciated.
The pipeline looks great. While if you want to get the results for the 3rd and 4th iteration, you may need to change the mode
here to 'r-g-r-g-r-g' or 'r-g-r-g-r-g-r-g' and then call run_RepoCoder_method
again to obtain the prompt files for the continued rounds.
@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?
Thanks a ton!
@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?
Thanks a ton!
Hey @kechenliuuu3469 -- apologies for the lack of response earlier on this. I think inference.py is a part of the repository and I hope this issue #28 will solve your problem.
Thank you very much for the contributions of the authors. While attempting to implement the RepoCoder method, we encountered an issue in the codegen_inference.py file. After modifying the file to:
we encountered the following error:
It seems that this file can only generate results for the in-file method.