microsoft / CodeT

MIT License
599 stars 76 forks source link

Issue with Handling Input Prompt Files in codegen-inference.py #32

Closed pppyb closed 4 months ago

pppyb commented 5 months ago

Thank you very much for the contributions of the authors. While attempting to implement the RepoCoder method, we encountered an issue in the codegen_inference.py file. After modifying the file to:

image

we encountered the following error:

image

It seems that this file can only generate results for the in-file method.

zfj1998 commented 5 months ago

Sometimes the tokenizer won't return the same number of tokens. So you may loose the token limitation a bit in https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/build_prompt.py#L15 and L18. For example, only allow 1900/900 tokens of the retrieved content for Codex/CodeGen.

pppyb commented 5 months ago

Thank you, Fengji. I followed your suggestions and made the modifications, but I am still encountering the same error; it doesn't seem to work. It only changed the retrieval context length, and the generated prompt still exceeds 2048.

image

Additionally, in codegen-inference.py, the truncation=True in this line prompts = self.tokenizer(prompt_batch, return_tensors='pt', padding=True, truncation=True) does not seem to function for some reason. If I manually force truncation, the quality of the generated code is very poor.

image

Since the RuntimeError occurs within the internals of the transformers model, it seems there is no direct way to access the tensor causing the error: Traceback (most recent call last): File "codegen_inference.py", line 78, in cg.batch_generate(file_path) File "codegen_inference.py", line 59, in batch_generate gen_text.extend(self._generate_batch(batch)) File "codegen_inference.py", line 40, in _generate_batch gen_tokens = self.model.generate( File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/generation_utils.py", line 1490, in generate return self.greedy_search( File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/generation_utils.py", line 2233, in greedy_search outputs = self( File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 693, in forward transformer_outputs = self.transformer( File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 578, in forward outputs = block( File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 304, in forward attn_outputs = self.attn( File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 251, in forward attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) File "/home/miniconda3/envs/repocoder/lib/python3.8/site-packages/transformers/models/codegen/modeling_codegen.py", line 167, in _attn attn_weights = torch.where(causal_mask, attn_weights, mask_value) RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3

Do you have any better solutions for the Salesforce/codegen-350M-mono model?

zfj1998 commented 5 months ago

if you use the force truncation by Tokenizer, it will change the last line of code, thus affecting the target hole of code completion. A better way is to reduce the length of the retrieved context or remove the in-file context from the beginning lines. You have to check more carefully whether the 'rg-one-gram-ws-20-ss-24.jsonl' is generated correctly since the code here https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/build_prompt.py#L77 explicitly controls the length of prompt.

pppyb commented 5 months ago

Thank you very much for your prompt response! I appreciate your suggestions and will attempt both solutions. However, I have some concerns regarding the process, as I am trying to replicate the results you presented in your paper on the codegen-350M-mono model, specifically those in Table 2 (a, b).

I followed the instructions provided in the README file meticulously and made no changes to the code logic other than modifying the hardcoded input paths. The steps I followed are as outlined below:

  1. I ran the run_RG1_and_oracle_method function in run_pipeline.py to generate prompts/rg-one-gram-ws-20-ss-2.jsonl.
  2. I then executed codegen_inference.py to produce the prediction file: prompts/rg-one-gram-ws-20-ss-2_codegen-350M-mono.jsonl.
  3. I updated the prediction_path in run_pipeline.py to the newly generated prediction file and reran the run_RepoCoder_method in run_pipeline.py to obtain prompts/repocoder-one-gram-ws-20-ss-2.jsonl.
  4. Lastly, I used the prompts/repocoder-one-gram-ws-20-ss-2.jsonl as an input to run codegen_inference.py again to get the results for the repocoder algorithm.

Could you please let me know if you encountered any issues when you achieved the results documented in your paper? I am concerned there may be an issue with my process since I am strictly adhering to the steps mentioned without altering any fundamental code logic.

Your guidance on this matter would be greatly appreciated.

zfj1998 commented 5 months ago

The pipeline looks great. While if you want to get the results for the 3rd and 4th iteration, you may need to change the mode here to 'r-g-r-g-r-g' or 'r-g-r-g-r-g-r-g' and then call run_RepoCoder_method again to obtain the prompt files for the continued rounds.

kechenliuuu3469 commented 4 months ago

@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?

Thanks a ton!

pppyb commented 4 months ago

@pppyb - thanks for posting this detailed issue. It has helped me understand the process of using this repo better. Although, I still do have one doubt remaining - did you implement codegen-inference.py yourself to query codegen? Or is this a part of the repository?

Thanks a ton!

Hey @kechenliuuu3469 -- apologies for the lack of response earlier on this. I think inference.py is a part of the repository and I hope this issue #28 will solve your problem.