lchen001 / LLMDrift

Apache License 2.0
332 stars 32 forks source link

Deceptive definition of "directly executable" code #3

Open python273 opened 1 year ago

python273 commented 1 year ago

I think it's reasonable to expect the models behavior might change a little with a version change, and e.g. you'd need to add a tiny bit of parsing to extract code from triple backticks.

georgeslabreche commented 1 year ago

Aha, I came to this repo to create the exact same Issue. Folks, those characters are not single quotes, they are called backticks and they have a purpose. It is so that the code can be formatted and prettified as code when pasted in a text editor or text files that supports that syntax, like markdown files, or even in this GitHub Issue comment (similar to how ChatGPT formats code samples in its responses):

class Solution(object):
  def isFascinating(self, n):
    # Convert n to string
    n_str = str(n)
    # Concatenate n with 2*n and 3*n
    concatenated_str = n_str + str(2 * n) + str(3 * n)
    # Check if the concatenated string contains
    # all digits from 1 to 9 exactly once
    if set(concatenated_str) == set('123456789'):
        return True
    else:
        return False

You've even used those backticks in your README for your LaTeX citation info.

Misinterpreting the intent of the backticks pretty much invalidates your study's conclusion (and you might want to refrain from publication until it is corrected). Don't be discouraged, though! I would love to see a revised study with a correct interpretation of the generated code. Holler when you've addressed this issue, go team! 💪👌

python273 commented 1 year ago

I think authors just wanted to hype on twitter 🤡

The math questions were of the form “Is 17077 prime”? They picked 500 numbers, but all of them were prime!

The June version of GPT-3.5 and the March version of GPT-4 almost always conclude that the number is prime regardless of whether it is prime or composite. The other two models do the opposite. But the paper only tested prime numbers, and hence concluded that GPT-3.5’s performance improved while GPT-4’s degraded.

https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-time

xadhrit commented 1 year ago

I think authors just wanted to hype on twitter clown_face

The math questions were of the form “Is 17077 prime”? They picked 500 numbers, but all of them were prime!

The June version of GPT-3.5 and the March version of GPT-4 almost always conclude that the number is prime regardless of whether it is prime or composite. The other two models do the opposite. But the paper only tested prime numbers, and hence concluded that GPT-3.5’s performance improved while GPT-4’s degraded.

https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-time

They were trying hard to produce this behaviour [intentionally]

siboehm commented 1 year ago
I tried re-running the experiments and stripping these backticks from their data. It seems the June GPT-4 version actually does better than the march version: Leetcode accept june_fixed june_orig march_orig
True 35 5 26
False 15 45 24

See my fork: https://github.com/siboehm/LLMDrift

python273 commented 1 year ago

@siboehm March version might get better too with code extraction?

siboehm commented 1 year ago

No, in the author's dataset there are no responses from the march model that have markdown formatting

knilink commented 1 year ago

The point isn't just to test whether the model can write code but also how well the model would follow users' instructions as users would expect the response should not be including those "quotes" when the prompt state "code only".

Though, it's questionable whether the prompt is optimal.

python273 commented 1 year ago

Then why mix measuring correctness of the code with whether the code is in markdown block? It's arguable what "code only" means for chat model that supposed to use formatting.

The authors made big claims on large performance decrease on twitter, and now pretend it's only about formatting (that can stripped with a line of code) and CoT not working (for their specific prompts).

PriNova commented 1 year ago

When I, as a company, need to use regex to extract code from triple backticks, which was not needed with the older model, I will definitely say: "This chat bot decreases in performance, even I instructed to provide 'only code'". Why? I need to put extra code for extraction. Scaled up, this is a performance degradation on my side and on the behavior of the LLM. If it is a semantical difference, then it must be valued differently. But the authors said, that the triple backticks were not the only evaluation point. Most people do not consider the overall context.

sovos15 commented 1 year ago

I tried re-running the experiments and stripping these backticks from their data. It seems the June GPT-4 version actually does better than the march version:

Leetcode accept june_fixed june_orig march_orig True 35 5 26 False 15 45 24 See my fork: https://github.com/siboehm/LLMDrift

I appreciate you actually doing the experiment though I wonder how you performed time travel to yourself go back and use closed source code for months ago, I appreciate the astounding effort.

Just so that I understand your premise though.

"Prompting is not for regular people it's a new coding language that is going to require the same amount of semantical rigor and attention to syntax as modern coding, these large language models are not intended to understand common English they're intended to understand some pig Latin esk modification of English that changes month to month and needs to be relearned"

Do I have that about right? Anyone seriously think that's a sane position?