gptlint / gptlint

A linter with superpowers! 🔥 Use LLMs to enforce best practices across your codebase.
https://gptlint.dev
MIT License
213 stars 19 forks source link

Add support for 2-pass linting #4

Closed transitive-bullshit closed 4 months ago

transitive-bullshit commented 4 months ago

This PR adds support for a two-pass linting strategy, using a weakModel to generate potential rule violations first, and then if any potential rule violations are found, using a stronger model to validate whether or not these potential rule violations are real or false positives.

Advantages of this approach

Disadvantages of this approach

CleanShot 2024-04-02 at 18 44 22@2x

Next Steps

Overall, I'm very happy with the tradeoffs of this new approach. The cost / benefits are clear when running the old version vs the new version side-by-side, and I expect that with some more experimentation, we'll be able to address the most common issues with the weaker models. Users who want to have the highest likelihood of not missing potential errors can use the following config to get the more expensive single-pass approach:

// gptlint.config.js
export default [
  {
    llmOptions: {
      model: 'gpt-4-turbo-preview',
      weakModel: null // setting this disables two-pass linting
    }
  }
]
transitive-bullshit commented 4 months ago

Update: I think I've gotten gpt-3.5-turbo to no longer spit out mega codeSnippet blocks which makes things work much more reliably.

transitive-bullshit commented 4 months ago

Before & After Results

OpenAI

Left (before) is using gpt-4-turbo-preview with single-pass linting.

Right (after) uses gpt-4-turbo-preview as the strong model and gpt-3.5-turbo as the weakModel with two-pass linting.

Before         After

SUMMARY: with single-pass linting, GPT-4 Turbo found 1 false positive, cost $4.72 USD, and took 1m32s. With two-pass linting, it correctly found no errors, cost $0.78, and took 41s – a 6x reduction in cost with an increase in accuracy and a ~2.2x speedup.

Anthropic Claude

Left (before) is using anthropic/claude-3-opus:beta with single-pass linting.

Right (after) uses anthropic/claude-3-opus:beta as the strong model and anthropic/claude-3-haiku:beta as the weakModel with two-pass linting.

Before         After

SUMMARY: with single-pass linting, Claude Opus found 25 false positives, cost $9.80 USD, and took 3m16s. With two-pass linting, it found only 2 false positives (which I'm sure we can get rid of), cost $0.67, and took 1m8s – a 15x reduction in cost with a huge increase in accuracy and a ~3x speedup.

mergebandit commented 4 months ago

First of all, 🔥🔥🔥. The results speak for themselves. Substantively, very little to add.

But as somebody who is relatively familiar with a lot of what's going on, and how this has evolved, this is definitely making it harder to reason about (whether its the modelSupportsJsonResponseFormat, or the two-pass) just by reading through lint-file.ts, while the previous version was like 50% the LoC (and much of that was the prompt itself).

Very reasonable for this to just be the natural evolution from "naive first pass" to "this actually has to work".

transitive-bullshit commented 4 months ago

this is definitely making it harder to reason about (whether its the modelSupportsJsonResponseFormat, or the two-pass) just by reading through lint-file.ts, while the previous version was like 50% the LoC (and much of that was the prompt itself).

Agreed. I added a TODO to lint-file.ts to reduce this duplication. It is fundamentally more complicated since it's doing more, but we should be able to reduce 90% of the duplicated code and prompts as this was just a first pass to see if this approach was actually worth the effort. I may end up addressing the duplication and simplifying things where I can before merging this PR.

transitive-bullshit commented 4 months ago

Very reasonable for this to just be the natural evolution from "naive first pass" to "this actually has to work".

Yep. This is where 95% of the hidden depth tends to lie in all LLM-based products. The transition from "cool demo" to "reliable enough to use in production" is where the dragons lurk. Some folks refer to this as the last mile problem of gen AI.