Users can mention larger files

dominiccooney commented 5 months ago

User @ mention budget - this should be large enough for a user to @ mention 5 files that are 85th percentile in size for the average codebase, e.g. 5 x 1000 LOC = 5,000 LOC x 7.5 tokens / line = 38k tokens

Okay to have a per file limit, so that we don't allow a single 38k token file to be input. e.g. no file exceeds 7k tokens, but you can add as many files as you want up to 38k tokens.

This extra budget applies specifically to @-mentioned user context files. For example, for chat, DefaultPrompter needs new logic to apply this special budget.

### Acceptance Criteria
- [ ] We have the Telemetry to know the sizes of files users are really tagging and the rate at which they hit the new limit
- [ ] Users can mention larger files, even if it means the interaction exceeds the old context window limit, without an error
- [ ] There's a limit that is applied per-file. When users @-tag files exceeding this limit they see the typical large file error messages.
- [ ] There's a limit that applies to total @-tagged files. Exceeding this limit shows a new error message with the same basic UX rendering that explains you've mentioned too many files.
- [ ] The @-tagged files budget works across follow-up chats: Follow up chats work, the limits are applied across the total @-tagged files.

### Tasks
- [x] @chillatom pick a file size limit that will accommodate "most" files
- [x] @chillatom @toolmantim design the product behavior when you @-tag files in follow-up chats, what's the policy for dropping old files? and elaborate this task list
- [ ] @toolmantim could you design language for the "each file you @-tagged was ok but you tagged too many in aggregate"
- [ ] https://github.com/sourcegraph/cody/issues/3819
- [ ] https://github.com/sourcegraph/cody/issues/3745
- [ ] https://github.com/sourcegraph/cody/issues/3743
- [ ] https://github.com/sourcegraph/cody/issues/3806
- [ ] https://github.com/sourcegraph/cody/issues/3744
- [ ] https://github.com/sourcegraph/cody/pull/3888
- [ ] https://github.com/sourcegraph/sourcegraph/pull/62088

[ ] Allow user to add selected code from editor to chat message from context menu

toolmantim commented 5 months ago

could you design language for the "each file you @-tagged was ok but you tagged too many in aggregate"

@dominiccooney @chillatom would it suffice to prevent people from @'ing files if they hit too many?

CleanShot 2024-04-03 at 11 07 23@2x

One gotcha is that you can @ mention things that aren't files too. Are symbols, URLs and line ranges included in the budget?

Alternatively, you let people @ whatever they want but then change the color of the tokens to show which ones won't be included, similar to this past design concept around context limits:

toolmantim commented 5 months ago

design the product behavior when you @-tag files in follow-up chats, what's the policy for dropping old files? and elaborate this task list

Straw person suggestion:

We let people @ things using the same token limits for follow up messages
Any past @ things will be excluded from context if they can't fit into the budget, evicting the oldest first

dominiccooney commented 5 months ago

@toolmantim

would it suffice to prevent people from @'ing files if they hit too many?

My $0.02: We thought about making it a token count budget, but I like this better. It is way easier to understand, error message is nice and crisp. No file is larger than A, you can have N of them, so there is a limit of A × N. LLM attention might benefit from fewer inputs regardless of the length anyway. And we can always increase it/change it later.

Are symbols, URLs and line ranges included in the budget?

My $0.02:

Definitely for line ranges because in the limit they're isomorphic to files. If we're doing the N file limit, later we could take a bunch of line ranges in the same file and say that count as 1, harmonize it that way?

URLs: I could go either way. URLs we didn't ship (IIRC) so as long as we loop back before shipping it... let's add a launch blocking task to a mini-PRD URL issue.

abeatrix commented 5 months ago

Does the context limit counter limited to @-mentions context only?

How does the user know if they are also hitting their input token limit? E.g., when they just copy and paste a file into the chat box

chillatom commented 5 months ago

I ran a little analysis over some common repos.

A budget of 30k tokens would give us 10 files at the 90th percentile.

Total files: 20571 Mean: 1073.6000194448495, Median: 431.0, Max: 9120 90th Percentile: 3081.0

Repos checked https://github.com/sourcegraph/cody.git, https://github.com/facebook/react.git, https://github.com/django/django.git, https://github.com/rust-lang/rust.git, https://github.com/golang/go.git, https://github.com/apache/kafka.git, https://github.com/google/leveldb.git

chillatom commented 5 months ago

We let people @ things using the same token limits for follow up messages Any past @ things will be excluded from context if they can't fit into the budget, evicting the oldest first

@toolmantim I like this. So we would have 30k token budget (per analysis above)

First Message: User mentions 3 files of 3k each, eating up 9k / 30k token budget

Files = [file1_3k, file2_3k, file3_3k]
used_tokens = 9k
Remaining_tokens = 21k

Follow up 1: User mentions 4 more files of 5k each, eating up 20k more tokens. We keep the 9k tokens worth of chat messages previously mentioned for a total of 29k/30k used.

Files = [file1_3k, file2_3k, file3_3k, file4_5k, file5_5k, file6_5k, file7_5k]
used_tokens = 29k
Remaining_tokens = 1k

Follow up 2: User mentions 2 more files of 4k each, eating up 8k tokens. We now remove the 3 oldest files (freeing up 9k tokens) to make space for the 2 new files and 8k tokens they consume.

Files = [file4_5k, file5_5k, file6_5k, file7_5k, file8_4k, file9_4k]
Removed_files = [file1_3k, file2_3k, file3_3k]
used_tokens = 28k
Remaining_tokens = 2k

abeatrix commented 5 months ago

30k token

@chillatom is this 30k token equal to 30k * 4byte per token = 120k bytes?

Update: confirmed this is referring to token count and not bytes

sourcegraph / cody

Users can mention larger files #3664