sourcegraph / cody

Type less, code more: Cody is an AI code assistant that uses advanced search and codebase context to help you write and fix code.
https://cody.dev
Apache License 2.0
2.8k stars 304 forks source link

adding more granular diff format for autoedits model training #6173

Closed hitesh-1997 closed 21 hours ago

hitesh-1997 commented 4 days ago

Context

The PR makes the following high-level changes:

  1. Current auto-edit model have trouble understanding the most recent diffs, where it suggest deleting the recently added line or suggest the change recently deleted. One reason is that it doesn't have a seperate view of short term and long term diff.
  2. Introduces a more granular diff format for training the auto-edits model. Currently we only use a single diff format. The PR computes the line level for the changes made in the editor. In addition, ensures that all the continuous changes are groped together as a single entity. Additionally, it derives some strategies to calculate the diff at different granularity levels. Refer to the class for the entry point.
  3. Introduce a helper function to diff format, to simulate the document changes using markers. Refer to helper function here
  4. Refactors recent edits handling to separate long-term and short-term diffs.
  5. Initially the data is logged to the telemetry, to be used for training and evaluating the model offline.
  6. One final change is to log 10 sec diff data by the user in the analytics to capture the short term diffs.

Test plan

Added Unit tests for various changes

hitesh-1997 commented 21 hours ago

As we discussed last week, let's aim to keep PRs focused on a single key change. This will make them easier to review and help reduce the risk of regressions.

This PR could be split into a series of changes that build on each other. For example, we could start by updating the implementation of the UnifiedDiffStrategy with the relevant interfaces (in DefaultContextStrategyFactory and ContextRetrieverDataCollection). After that, we can introduce one new diff strategy per PR: first TwoStageUnifiedDiffStrategy, then LineLevelDiffStrategy. The changes to prompt utils and RecentViewPortRetriever seem relatively independent and could also be extracted into smaller, separate PRs out of the stack.

I know it's not fun to spend time on this "extra" work, but we need to make it a habit from the get-go. This PR could be a great place to start. Holding each other accountable will make it easier to iterate faster in the long run.

Refactored the PR into several other PRs. Closing this one.

  1. https://github.com/sourcegraph/cody/pull/6188
  2. https://github.com/sourcegraph/cody/pull/6189
  3. https://github.com/sourcegraph/cody/pull/6190
  4. https://github.com/sourcegraph/cody/pull/6191