transitive-bullshit commented 4 months ago

This PR adds support for a two-pass linting strategy, using a weakModel to generate potential rule violations first, and then if any potential rule violations are found, using a stronger model to validate whether or not these potential rule violations are real or false positives.

Advantages of this approach

overall false positive rate is significantly smaller (TODO: need to quantify this)
since the faster, cheaper, weak model is used for 95% of the work, the linter is roughly an order of magnitude faster and cheaper
this approach is complementary to the single-pass approach; e.g., it is possible to configure the engine with no weakModel by setting it to null or "none", in which case the linter will use the old single-pass approach with the configured model

Disadvantages of this approach

if the weak model misses potential rule violations, then it will be harder to detect these (e.g., false negatives)
- currently, I'm a lot more worried about mitigating false positives because a noisy linter that you end up ignoring isn't useful to anybody and I haven't seen too many false negatives with the current rule/file-based classifier approach
slightly more complex implementation
in the worst case scenario of every rule and every file having errors, this could result in up to 2x the amount of LLM calls
the weak models I've tested with (gpt-3.5-turbo and anthropic/claude-3-haiku:beta) are orders of magnitude faster and cheaper, but they also suffer from occasional issues where they try to output the whole file in rule violation codeSnippet or get stuck like in the following screenshot from gpt-3.5-turbo:

CleanShot 2024-04-02 at 18 44 22@2x

Next Steps

Overall, I'm very happy with the tradeoffs of this new approach. The cost / benefits are clear when running the old version vs the new version side-by-side, and I expect that with some more experimentation, we'll be able to address the most common issues with the weaker models. Users who want to have the highest likelihood of not missing potential errors can use the following config to get the more expensive single-pass approach:

// gptlint.config.js
export default [
  {
    llmOptions: {
      model: 'gpt-4-turbo-preview',
      weakModel: null // setting this disables two-pass linting
    }
  }
]

transitive-bullshit commented 4 months ago

Update: I think I've gotten gpt-3.5-turbo to no longer spit out mega codeSnippet blocks which makes things work much more reliably.

transitive-bullshit commented 4 months ago

Before & After Results

OpenAI

Left (before) is using gpt-4-turbo-preview with single-pass linting.

Right (after) uses gpt-4-turbo-preview as the strong model and gpt-3.5-turbo as the weakModel with two-pass linting.

Before After

SUMMARY: with single-pass linting, GPT-4 Turbo found 1 false positive, cost $4.72 USD, and took 1m32s. With two-pass linting, it correctly found no errors, cost $0.78, and took 41s – a 6x reduction in cost with an increase in accuracy and a ~2.2x speedup.

Anthropic Claude

Left (before) is using anthropic/claude-3-opus:beta with single-pass linting.

Right (after) uses anthropic/claude-3-opus:beta as the strong model and anthropic/claude-3-haiku:beta as the weakModel with two-pass linting.

Before After

SUMMARY: with single-pass linting, Claude Opus found 25 false positives, cost $9.80 USD, and took 3m16s. With two-pass linting, it found only 2 false positives (which I'm sure we can get rid of), cost $0.67, and took 1m8s – a 15x reduction in cost with a huge increase in accuracy and a ~3x speedup.

mergebandit commented 4 months ago

First of all, 🔥🔥🔥. The results speak for themselves. Substantively, very little to add.

But as somebody who is relatively familiar with a lot of what's going on, and how this has evolved, this is definitely making it harder to reason about (whether its the modelSupportsJsonResponseFormat, or the two-pass) just by reading through lint-file.ts, while the previous version was like 50% the LoC (and much of that was the prompt itself).

Very reasonable for this to just be the natural evolution from "naive first pass" to "this actually has to work".

transitive-bullshit commented 4 months ago

this is definitely making it harder to reason about (whether its the modelSupportsJsonResponseFormat, or the two-pass) just by reading through lint-file.ts, while the previous version was like 50% the LoC (and much of that was the prompt itself).

Agreed. I added a TODO to lint-file.ts to reduce this duplication. It is fundamentally more complicated since it's doing more, but we should be able to reduce 90% of the duplicated code and prompts as this was just a first pass to see if this approach was actually worth the effort. I may end up addressing the duplication and simplifying things where I can before merging this PR.

transitive-bullshit commented 4 months ago

Very reasonable for this to just be the natural evolution from "naive first pass" to "this actually has to work".

Yep. This is where 95% of the hidden depth tends to lie in all LLM-based products. The transition from "cool demo" to "reliable enough to use in production" is where the dragons lurk. Some folks refer to this as the last mile problem of gen AI.

gptlint / gptlint

Add support for 2-pass linting #4

Advantages of this approach

Disadvantages of this approach

Next Steps

Before & After Results

OpenAI

Anthropic Claude