mufeedvh / code2prompt

A CLI tool to convert your codebase into a single LLM prompt with source tree, prompt templating, and token counting.
MIT License
1.45k stars 77 forks source link

Any plans to introduce code indexing? #16

Open RastislavKish opened 4 months ago

RastislavKish commented 4 months ago

Hello,

first of all, a cool project!

Larger codebases often significantly exceed even the largest context windows available these days, while offline LLMs are even more troublesome in this regard.

It could be useful to implement an indexing feature, that would not generate a single prompt from the codebase, but instead output multiple smaller prompts containing max. N tokens, with the purpose of creating some kind of code abstraction. This abstract could be afterwards used together with just a single file of code where modifications should be made.

I don't use LLMs for coding very frequently, but this seems like the only plausible approach for fitting large codebases into LLLMs. Have you made any considerations/experiments with this approach and possible implementation into code2prompt?

mufeedvh commented 3 months ago

Hi @RastislavKish, this is an interesting feature request. Dividing prompts into multiple chunks would lose important context when working with the entire codebase and the context-window applies to the entire conversation with an LLM just that it acts as a sliding window where it loses context as we consume more tokens.

Could you please describe how you'd be using such a feature? I'll see if I could think about a feature that could tailor to your needs.

swiftugandan commented 2 months ago

Aider gives some clues on how you could compress the context by just looking at the symbols in the code. https://aider.chat/2023/10/22/repomap.html ... Perhaps you could consider something similar.

dbenn8 commented 1 month ago

I once built a really simple python script that traversed the code files in a project and pulled out each function (and maybe symbol?) and created a single markdown file with the parameters, return type, and comments all grouped by file within it.

It was probably a hacky solution to this problem (context Windows were smaller then), but it does help the LLM get broad overall context if you also feed it the full details of sections of the code more relevant to the specific problem you want help with.