microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.56k stars 366 forks source link

CodeBLEU: the order of normalized variables might affect the data flow match score #104

Closed JiyangZhang closed 2 years ago

JiyangZhang commented 2 years ago

Please take a look at the following example:

ref = "\ if ( db . getCollectionNames () . contains ( collectionName ) ) { \ db . getCollection ( collectionName ) . drop () ; \ mongoDBCollections . remove ( collectionName ) ; \ }"

example1 = " \ if ( ( collectionName != null ) && ( ! ( db . getCollectionNames () . contains ( collectionName ) ) ) ) { \ db . getCollection ( collectionName ) . drop () ; \ mongoDBCollections . remove ( collectionName ) ; \ }" example2 = " \ if ( ( ( db ) != null ) && ( ! ( db . getCollectionNames () . contains ( collectionName ) ) ) ) { \ db . getCollection ( collectionName ) . drop () ; \ mongoDBCollections . remove ( collectionName ) ; \ }"

For example 1 here is the data flow graph and score:

ref dfg: [('db', 2, 'comesFrom', [], []), ('collectionName', 10, 'comesFrom', [], []), ('db', 14, 'comesFrom', ['db'], [2]), ('collectionName', 18, 'comesFrom', ['collectionName'], [10]), ('collectionName', 29, 'comesFrom', ['collectionName'], [10])] cand dfg: [('collectionName', 3, 'comesFrom', [], []), ('db', 11, 'comesFrom', [], []), ('collectionName', 19, 'comesFrom', ['collectionName'], [3]), ('db', 25, 'comesFrom', ['db'], [11]), ('collectionName', 29, 'comesFrom', ['collectionName'], [3]), ('collectionName', 40, 'comesFrom', ['collectionName'], [3])] Normalized ref dfg: [('var_0', 'comesFrom', []), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1']), ('var_1', 'comesFrom', ['var_1'])] Normalized cand dfg: [('var_0', 'comesFrom', []), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1']), ('var_0', 'comesFrom', ['var_0']), ('var_0', 'comesFrom', ['var_0'])] 0.709 | 0.973 | 0.875 | 0.800 > 0.839 83.91522695531542

For example 2, here is the data flow graph and score

ref dfg: [('db', 2, 'comesFrom', [], []), ('collectionName', 10, 'comesFrom', [], []), ('db', 14, 'comesFrom', ['db'], [2]), ('collectionName', 18, 'comesFrom', ['collectionName'], [10]), ('collectionName', 29, 'comesFrom', ['collectionName'], [10])] cand dfg: [('db', 4, 'comesFrom', [], []), ('db', 13, 'comesFrom', ['db'], [4]), ('collectionName', 21, 'comesFrom', [], []), ('db', 27, 'comesFrom', ['db'], [4]), ('collectionName', 31, 'comesFrom', ['collectionName'], [21]), ('collectionName', 42, 'comesFrom', ['collectionName'], [21])] Normalized ref dfg: [('var_0', 'comesFrom', []), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1']), ('var_1', 'comesFrom', ['var_1'])] Normalized cand dfg: [('var_0', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1']), ('var_1', 'comesFrom', ['var_1'])] 0.675 | 0.973 | 0.875 | 1.000 > 0.881 88.08105911288919

You can see data flow match for ex1 is 0.8 and 1.0 for ex2. Based on the algorithm, they should have the same score 1.0 because they both cover all the data flows in reference. The difference is because of the order of variables after normalization, i.e. var_0 in ex1 is var_1 in ex2. Is it an unexpected behavior?

Thanks!

Imagist-Shuo commented 2 years ago

Hi, thanks for your question. I wonder if there is some data format error in your examples because I wrote a test file to repeat the behavior, but I got two 1.0 match scores.

The test script is:

import dataflow_match

ref = ["if ( db . getCollectionNames () . contains ( collectionName ) ) { db . getCollection ( collectionName ) . drop () ; mongoDBCollections . remove ( collectionName ) ; }"] example1 = "if ( ( collectionName != null ) && ( ! ( db . getCollectionNames () . contains ( collectionName ) ) ) ) { db . getCollection ( collectionName ) . drop () ; mongoDBCollections . remove ( collectionName ) ; }" example2 = "if ( ( ( db ) != null ) && ( ! ( db . getCollectionNames () . contains ( collectionName ) ) ) ) { db . getCollection ( collectionName ) . drop () ; mongoDBCollections . remove ( collectionName ) ; }"

print('======================= example:1 =======================') score1 = dataflow_match.calc_dataflow_match(ref, example1, 'c_sharp') print('======================= example:2 =======================') score2 = dataflow_match.calc_dataflow_match(ref, example2, 'c_sharp')

print('======================= DF match score =======================') print(score1, score2)

And I print the DFGs and normalized DFGs during the calculation. The final output is: ======================= example:1 ======================= Reference DFG: [('db', 2, 'comesFrom', [], []), ('collectionName', 10, 'comesFrom', [], []), ('db', 14, 'comesFrom', ['db'], [2]), ('collectionName', 18, 'comesFrom', ['collectionName'], [10])] Candidate DFG: [('collectionName', 3, 'comesFrom', [], []), ('db', 11, 'comesFrom', [], []), ('collectionName', 19, 'comesFrom', ['collectionName'], [3]), ('db', 25, 'comesFrom', ['db'], [11]), ('collectionName', 29, 'comesFrom', ['collectionName'], [3]), ('collectionName', 40, 'comesFrom', ['collectionName'], [3])] Normalized reference DFG: [('var_0', 'comesFrom', []), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1'])] Normalized candidate DFG: [('var_0', 'comesFrom', []), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1']), ('var_0', 'comesFrom', ['var_0']), ('var_0', 'comesFrom', ['var_0'])] ======================= example:2 ======================= Reference DFG: [('db', 2, 'comesFrom', [], []), ('collectionName', 10, 'comesFrom', [], []), ('db', 14, 'comesFrom', ['db'], [2]), ('collectionName', 18, 'comesFrom', ['collectionName'], [10])] Candidate DFG: [('db', 4, 'comesFrom', [], []), ('db', 13, 'comesFrom', ['db'], [4]), ('collectionName', 21, 'comesFrom', [], []), ('db', 27, 'comesFrom', ['db'], [4]), ('collectionName', 31, 'comesFrom', ['collectionName'], [21]), ('collectionName', 42, 'comesFrom', ['collectionName'], [21])] Normalized reference DFG: [('var_0', 'comesFrom', []), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1'])] Normalized candidate DFG: [('var_0', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', []), ('var_0', 'comesFrom', ['var_0']), ('var_1', 'comesFrom', ['var_1']), ('var_1', 'comesFrom', ['var_1'])] ======================= DF match score ======================= 1.0 1.0

which shows both of the data-flow match scores are 1.0.

JiyangZhang commented 2 years ago

Hi, thanks for your response!

First, the example is actually java snippet rather than c_sharp. Second, can I ask which version of tree-sitter you are using? It seems the version will affect the results? I got different scores with 0.2.1

Imagist-Shuo commented 2 years ago

Hi, (1) I switched to java, and the problem you mentioned occurred. The normalized method is nontrivial and the current method relies on the order of variables before normalization. The best way is probably to enumerate all normalized numbers and calculate the maximum match score, but this is unrealistic. Actually, it is a hard problem to calculate the semantic match score, which needs a deep understanding of the code snippet. In the future, we may use the scores generated with deep models (such as BERTScore) as a supplement to BLEU-style scores for the semantic match.

(2) The version of tree-sitter we are using is 0.1.1. Different versions may affect the result because of the language change.

JiyangZhang commented 2 years ago

That makes sense.

Thank you!