sola-st / crystalbleu

Apache License 2.0
22 stars 5 forks source link

Is output value correct? #1

Closed jackswl closed 7 months ago

jackswl commented 7 months ago

Hi all,

Thanks for the wonderful work. However, I am encountering a problem regarding the output value. For example, when I compare two codes (actual and generated), I get some kind of value like this:

4.939266421177758e-232

Why is the value like this? I cannot find any additional information from the paper for code evaluation.

In the crystalbleu.py, it wrote this:

hyp1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
'ensures', 'that', 'the', 'military', 'always',
    'obeys', 'the', 'commands', 'of', 'the', 'party']
ref1a = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
        'ensures', 'that', 'the', 'military', 'will', 'forever',
        'heed', 'Party', 'commands']
ref1b = ['It', 'is', 'the', 'guiding', 'principle', 'which',
    'guarantees', 'the', 'military', 'forces', 'always',
        'being', 'under', 'the', 'command', 'of', 'the', 'Party']
ref1c = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
        'army', 'always', 'to', 'heed', 'the', 'directions',
        'of', 'the', 'party']

hyp2 = ['he', 'read', 'the', 'book', 'because', 'he', 'was',
        'interested', 'in', 'world', 'history']
ref2a = ['he', 'was', 'interested', 'in', 'world', 'history',
        'because', 'he', 'read', 'the', 'book']

list_of_references = [[ref1a, ref1b, ref1c], [ref2a]]
hypotheses = [hyp1, hyp2]
corpus_bleu(list_of_references, hypotheses) # doctest: +ELLIPSIS

However, for code, am I supposed to split my code into... individual words? Could you kindly refer me to some examples on how you evaluate code with this metric?

Otherwise, how can I scale/modify this value such that I can do some kind of comparison with other code metrics like CodeBLEU?

AryazE commented 7 months ago

Why is the value like this?

This is a very small value (almost zero), which means the generated code and the ground truth do not have any matching n-grams. If this is not the case, it would be great if you can share the example, so I can look into it.

However, for code, am I supposed to split my code into... individual words?

Yes, but what you split into might depend on your use case. In most cases, tokenizing the code is what you need.

Could you kindly refer me to some examples on how you evaluate code with this metric?

There are multiple scripts in this repository that use CrystalBLEU to measure similarities. You can find a simple and complete example in this script. In this example, the code is tokenized using a lexer.

Please let me know if you have further questions.

jackswl commented 7 months ago

Hi @AryazE, thanks for the response. Allow me to be more specific - I am currently doing fine-tuning LLM for code generation. Please correct me if I am wrong at any steps:

If I am understanding you correctly (and your paper as well), I first have a 'Code Corpus'. This Code Corpus would be the training dataset that contains the code only, which was fed into the LLM for fine-tuning. I want to clarify, that this Code Corpus is tokenised, meaning I should pass each string of code through the LLM tokeniser, and decode it, such that it will be represented as a string of tokens text. Is this right? (of course, lexer works too, but for this purpose, I could simply tokenise through my LLM)

I also have a testing dataset, where I have the respective generated code from the LLM, and its corresponding actual code (test set). Similarly, I will break the generated code and the actual code up into their respective tokens, and then calculate the CrystalBLEU score. Let's assume we do not have trivially shared ngrams, and let's assume the code are broken up like this:

from crystalbleu import corpus_bleu

references = ['def', 'sum', '(a, b)', 'return', 'a + b']
candidates = ['def', 'sum', '(a, b)', 'return', 'a + b']

# 3. Calculate CrystalBLEU
crystalBLEU_score = corpus_bleu(
    references, candidates, ignoring=trivially_shared_ngrams)

crystalBLEU_score

As you can see, the references and candidates are exact same. But it gives me a score of: 1.7808657774417477e-231

Why does this happen?

AryazE commented 7 months ago

Yes, you can also use other tokenization methods (such as tokenizing using a trained model).

The problem in the script here is that the references argument in corpus_bleu is supposed to be a list of references for each hypothesis, so in this example you should have:

references = [[['def', 'sum', '(a, b)', 'return', 'a + b']]]
candidates = [['def', 'sum', '(a, b)', 'return', 'a + b']]
jackswl commented 7 months ago

hi @AryazE ,

so, you mean, both references and candidates should be a list of references and candidates respectively, right? I am comparing 1 generated code output (candidate) with the corresponding actual code (reference)

references = [['def', 'sum', '(a, b)', 'return', 'a + b']]
candidates = [['def', 'sum', '(a, b)', 'return', 'a + b']]

Even this, I get a score of 0, despite being identical

AryazE commented 7 months ago

I was wrong in my previous comment (updated for future reference). Let me explain it this way. The corpus_bleu function calculates the score for a list of hypotheses against a list of references for each hypothesis. So if you have one code piece, you need a 2 dimension array for the hypothesis and a 3 dimension array for the references.

jackswl commented 7 months ago

oh you're right, i totally overlooked this. thank you so much!