microsoft / IRNet

An algorithm for cross-domain NL2SQL
MIT License
264 stars 81 forks source link

Is IRNet_pretrained.model supposed to achieve 50%+ dev accuracy? #18

Open oney opened 4 years ago

oney commented 4 years ago

I evaluate using eval.sh with IRNet_pretrained.model, and run spider official script. But I got strange result.

                     easy                 medium               hard                 extra                all
count                249                  438                  171                  170                  1028

====================== EXACT MATCHING ACCURACY =====================
exact match          0.084                0.078                0.088                0.065                0.079

---------------------PARTIAL MATCHING ACCURACY----------------------
select               0.207                0.204                0.340                0.253                0.234
select(no AGG)       0.224                0.209                0.346                0.259                0.243
where                0.175                0.168                0.111                0.176                0.162
where(no OP)         0.200                0.168                0.125                0.284                0.189
group(no Having)     0.091                0.318                0.286                0.333                0.285
group                0.000                0.217                0.122                0.286                0.182
order                0.000                0.105                0.275                0.298                0.164
and/or               1.000                0.912                0.898                0.890                0.927
IUEN                 0.000                0.000                0.105                0.148                0.071
keywords             0.369                0.313                0.246                0.224                0.298
---------------------- PARTIAL MATCHING RECALL ----------------------
select               0.201                0.192                0.304                0.235                0.220
select(no AGG)       0.217                0.196                0.310                0.241                0.228
where                0.194                0.158                0.090                0.133                0.148
where(no OP)         0.222                0.158                0.101                0.214                0.174
group(no Having)     0.150                0.315                0.359                0.177                0.269
group                0.000                0.215                0.154                0.152                0.172
order                0.000                0.133                0.186                0.173                0.148
and/or               0.936                0.960                0.961                0.942                0.951
IUEN                 0.000                0.000                0.051                0.111                0.080
keywords             0.433                0.303                0.205                0.188                0.283
---------------------- PARTIAL MATCHING F1 --------------------------
select               0.204                0.198                0.321                0.244                0.227
select(no AGG)       0.220                0.202                0.327                0.250                0.235
where                0.184                0.163                0.099                0.151                0.155
where(no OP)         0.211                0.163                0.112                0.244                0.181
group(no Having)     0.113                0.317                0.318                0.231                0.276
group                1.000                0.216                0.136                0.198                0.177
order                1.000                0.118                0.222                0.219                0.155
and/or               0.967                0.936                0.928                0.915                0.939
IUEN                 1.000                1.000                0.069                0.127                0.075
keywords             0.399                0.308                0.224                0.204                0.290

Did I do something wrong? Thanks!

BTW, the length of prediction of IRNet is 1028, and the length of official dev_gold.sql is 1034.

SivilTaram commented 4 years ago

It is worth noting that if the length of prediction is not consistent (1028 != 1034), the evaluation does not make sense as there are mismatchs between the groundtruth and prediction.

hanrelan commented 4 years ago

Hi, I'm having the same issue. The eval.sh script by default generates an output file of 1028 samples. Any advice on how to have it output 1034 samples so the spider evaluator can be used to replicate the leaderboard result?