microsoft / CodeT

MIT License
575 stars 73 forks source link

Retrieved Content #33

Open yiqingxyq opened 1 month ago

yiqingxyq commented 1 month ago

Hi, can you also provide the retrieved content for Table 3, or the code retrieved by UniXCoder? Thanks!

zfj1998 commented 1 month ago

We are truly sorry the generated content cannot be restored for now. We would love to help you reproduce the results though.

yiqingxyq commented 1 month ago

Thanks for being willing to help! Just want to ask about the retrieval setting. When you do retrieval, do you filter out the file containing the code to complete?

If not, the model is possible to retrieve the target of code generation as the context, which does not make much sense to me -- if you want to use a model to help you complete the code, by the time you call the model, the target code does not exist in the repo yet.

If you do, is there an efficient way to do that?

zfj1998 commented 1 month ago

Of course, we need to filter out the target file to avoid leakage. However, we did not filter out all the content in the target file. We keep the content in the front of the target file that is not covered by the context provided to the LM. For example, file A has 100 lines, we have line 20-80 as the unfinished code, and line 81 as the completion hole. During retrieval, we also retrieve line 1-19 as useful supplementary Information for the completion.

The code related to this matter is https://github.com/microsoft/CodeT/blob/35f54d60b152cc31d134b788e702878ad613d9f7/RepoCoder/search_code.py#L44

The context_start_lineno is metadata we stored for each completion case.

yiqingxyq commented 1 month ago

Thanks. The setting makes sense to me!

I retrieved the GT context for the "function" split by adapting your code (window_size=50, slice_size=5). Then I filtered out the unfinished part using your logic (line 44) and run code generation using ChatGPT (2k tokens for GT context, 2k for infile context). I only got Pass@1=0.2895, and you reported 0.4263.

The full evaluation results are:

{
    "EM": 0.10723860589812333,
    "ES": 0.48067297674081083,
    "Pass@1": 0.289544235924933,
}

Here's the GT context I got: repoeval-function-4k-gt-top5-filter.jsonl.txt

Can you run your code to produce the GT context file for RepoEval-function, so I can compare the difference? Thanks!!!

binwensun commented 2 weeks ago

i am also try to run this project successfully, cheer up and thx for your guys‘ issue, it gives me hope to do it!thx!

yiqingxyq commented 2 weeks ago

i am also try to run this project successfully, cheer up and thx for your guys‘ issue, it gives me hope to do it!thx!

If you're interested, here's our implementation of gt retrieval (without filtering the unfinished part): https://github.com/code-rag-bench/code-rag-bench/tree/main?tab=readme-ov-file#retrieval.

binwensun commented 2 weeks ago

wow!thx for your support!i will try it again!

获取 Outlook for iOShttps://aka.ms/o0ukef


发件人: Yiqing Xie @.> 发送时间: Sunday, July 14, 2024 8:03:35 AM 收件人: microsoft/CodeT @.> 抄送: SUN, Binwen [Alumni] @.>; Comment @.> 主题: Re: [microsoft/CodeT] Retrieved Content (Issue #33)

CAUTION: This email is not originated from PolyU. Do not click links or open attachments unless you recognize the sender and know the content is safe.

i am also try to run this project successfully, cheer up and thx for your guys‘ issue, it gives me hope to do it!thx!

If you're interested, here's our implementation of gt retrieval (without filtering the unfinished part): https://github.com/code-rag-bench/code-rag-bench/tree/main?tab=readme-ov-file#retrieval.

― Reply to this email directly, view it on GitHubhttps://github.com/microsoft/CodeT/issues/33#issuecomment-2227142427, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2SEINXCABAU7KMWYGS35VTZMG55PAVCNFSM6AAAAABIPPAYIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGE2DENBSG4. You are receiving this because you commented.Message ID: @.***>

[https://www.polyu.edu.hk/emaildisclaimer/PolyU_Email_Signature.jpg]

Disclaimer:

This message (including any attachments) contains confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this message and notify the sender and The Hong Kong Polytechnic University (the University) immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited and may be unlawful.

The University specifically denies any responsibility for the accuracy or quality of information obtained through University E-mail Facilities. Any views and opinions expressed are only those of the author(s) and do not necessarily represent those of the University and the University accepts no liability whatsoever for any losses or damages incurred or caused to any party as a result of the use of such information.