mufeedvh / code2prompt

A CLI tool to convert your codebase into a single LLM prompt with source tree, prompt templating, and token counting.
MIT License
1.45k stars 77 forks source link

Using code2prompt open large git repository and generate prompts for LLMs #27

Open LZING opened 2 months ago

LZING commented 2 months ago

Hi, mufeedvh. Thank you for a very nice application.

I'm running into a problem right now when dealing with large code repositories. When I'm dealing with small code repositories, code2prompt works great. But when I'm dealing with large repositories, I have a token overflow problem when interacting with LLM.

So how should we deal with large code repositories? Sending only part of the source code will affect the context. Now it seems that only Gemini 1.5pro can handle about 200m tokens, which is the upper limit.

Can you perform tuning on a large code repository? Or do you have any good suggestions?

bhanub2406 commented 3 weeks ago

Hi @LZING I don't really have a solution for your problem. But I have couple of observations from my experience

  1. Including large code repositories would mean the resultant prompt is very large, which is not supported by many LLMs. Even if they do support, the quality of output may not always be relevant to your expectations.
  2. Full context of the code may not be needed for all the usecases, Ex: find-security-vulnerabilities, github hub commits, git hub pull requests related templates need only a part of your code.
  3. You can you use --exclude, --include kind of arguments to reduce the size of code that you send to code2prompt. If your requirements are more complex than that, you can write a pre-processing script that fetches the required files/folders into temp folder which in turn can be passed to the code2prompt.