microsoft / IRNet

An algorithm for cross-domain NL2SQL
MIT License
264 stars 81 forks source link

Not reaching the ~53.4% on the Dev-Set #15

Closed brunnurs closed 4 years ago

brunnurs commented 4 years ago

First of all, thanks for making this project public! I think it helps the community a lot to takle the problem of Text2SQL and it would be nice if other participants of the Spider-Challenge would do the same.

If I run this code with the parameters from train.sh, I reach an acc of close to 50% on the dev-set (see exact results below). Do I miss something? Or do I need to fix first the issues you talk about in #14 to reach this numbers? And if so, could you get a bit more specific in the changes?

Results:

Epoch: 0    Sketch-Accuracy: 0.5593385214007782     Accuracy: 0.2616731517509728
Epoch: 1    Sketch-Accuracy: 0.6196498054474708     Accuracy: 0.321011673151751
Epoch: 2    Sketch-Accuracy: 0.6955252918287937     Accuracy: 0.3949416342412451
Epoch: 3    Sketch-Accuracy: 0.7130350194552529     Accuracy: 0.40077821011673154
Epoch: 4    Sketch-Accuracy: 0.7422178988326849     Accuracy: 0.4377431906614786
Epoch: 5    Sketch-Accuracy: 0.7431906614785992     Accuracy: 0.4591439688715953
Epoch: 6    Sketch-Accuracy: 0.7431906614785992     Accuracy: 0.45525291828793774
Epoch: 7    Sketch-Accuracy: 0.7130350194552529     Accuracy: 0.42509727626459143
Epoch: 8    Sketch-Accuracy: 0.7558365758754864     Accuracy: 0.4581712062256809
Epoch: 9    Sketch-Accuracy: 0.7315175097276264     Accuracy: 0.42996108949416345
Epoch: 10    Sketch-Accuracy: 0.7558365758754864     Accuracy: 0.4698443579766537
Epoch: 11    Sketch-Accuracy: 0.7490272373540856     Accuracy: 0.4581712062256809
Epoch: 12    Sketch-Accuracy: 0.7665369649805448     Accuracy: 0.47373540856031127
Epoch: 13    Sketch-Accuracy: 0.7519455252918288     Accuracy: 0.45233463035019456
Epoch: 14    Sketch-Accuracy: 0.7402723735408561     Accuracy: 0.4766536964980545
Epoch: 15    Sketch-Accuracy: 0.7509727626459144     Accuracy: 0.4494163424124514
Epoch: 16    Sketch-Accuracy: 0.7422178988326849     Accuracy: 0.45622568093385213
Epoch: 17    Sketch-Accuracy: 0.7490272373540856     Accuracy: 0.4406614785992218
Epoch: 18    Sketch-Accuracy: 0.7159533073929961     Accuracy: 0.433852140077821
Epoch: 19    Sketch-Accuracy: 0.7538910505836576     Accuracy: 0.4406614785992218
Epoch: 20    Sketch-Accuracy: 0.7801556420233463     Accuracy: 0.4698443579766537
Epoch: 21    Sketch-Accuracy: 0.7665369649805448     Accuracy: 0.4785992217898833
Epoch: 22    Sketch-Accuracy: 0.7626459143968871     Accuracy: 0.46303501945525294
Epoch: 23    Sketch-Accuracy: 0.7558365758754864     Accuracy: 0.4785992217898833
Epoch: 24    Sketch-Accuracy: 0.7431906614785992     Accuracy: 0.44260700389105057
Epoch: 25    Sketch-Accuracy: 0.7509727626459144     Accuracy: 0.4785992217898833
Epoch: 26    Sketch-Accuracy: 0.7529182879377432     Accuracy: 0.47373540856031127
Epoch: 27    Sketch-Accuracy: 0.7675097276264592     Accuracy: 0.48151750972762647
Epoch: 28    Sketch-Accuracy: 0.7568093385214008     Accuracy: 0.4805447470817121
Epoch: 29    Sketch-Accuracy: 0.7636186770428015     Accuracy: 0.4727626459143969
Epoch: 30    Sketch-Accuracy: 0.7665369649805448     Accuracy: 0.47568093385214005
Epoch: 31    Sketch-Accuracy: 0.7558365758754864     Accuracy: 0.48151750972762647
Epoch: 32    Sketch-Accuracy: 0.7616731517509727     Accuracy: 0.46303501945525294
Epoch: 33    Sketch-Accuracy: 0.7704280155642024     Accuracy: 0.4727626459143969
Epoch: 34    Sketch-Accuracy: 0.7645914396887159     Accuracy: 0.4776264591439689
Epoch: 35    Sketch-Accuracy: 0.7568093385214008     Accuracy: 0.4698443579766537
Epoch: 36    Sketch-Accuracy: 0.7616731517509727     Accuracy: 0.48249027237354086
Epoch: 37    Sketch-Accuracy: 0.7665369649805448     Accuracy: 0.4571984435797665
Epoch: 38    Sketch-Accuracy: 0.7607003891050583     Accuracy: 0.45428015564202334
Epoch: 39    Sketch-Accuracy: 0.7597276264591439     Accuracy: 0.4581712062256809
Epoch: 40    Sketch-Accuracy: 0.7626459143968871     Accuracy: 0.4678988326848249
Epoch: 41    Sketch-Accuracy: 0.7655642023346303     Accuracy: 0.4688715953307393
Epoch: 42    Sketch-Accuracy: 0.7626459143968871     Accuracy: 0.4659533073929961
Epoch: 43    Sketch-Accuracy: 0.7665369649805448     Accuracy: 0.46303501945525294
Epoch: 44    Sketch-Accuracy: 0.7616731517509727     Accuracy: 0.4659533073929961
Epoch: 45    Sketch-Accuracy: 0.7568093385214008     Accuracy: 0.4669260700389105
JasperGuo commented 4 years ago

Thanks for your appreciate. Having trained the model, use the eval.sh to get predictions from the trained model on dev set. Then, feed the output to the spider official evaluation script. The exact match accuracy reported from the evaluation script is expected to be close to 53%

Best Regards, Jiaqi Guo

brunnurs commented 4 years ago

Thanks for the hint, that makes sense... By using the spider evaluation script I get now the 53%, see below.

A non-related question: did you put any efforts into predicting values, to also calculate the execution accuracy?

                     easy                 medium               hard                 extra                all                 
count                246                  440                  174                  168                  1028                

====================== EXACT MATCHING ACCURACY =====================
exact match          0.724                0.543                0.437                0.310                0.530               

---------------------PARTIAL MATCHING ACCURACY----------------------
select               0.890                0.761                0.925                0.766                0.821               
select(no AGG)       0.915                0.786                0.943                0.778                0.842               
where                0.699                0.642                0.435                0.422                0.575               
where(no OP)         0.717                0.652                0.511                0.544                0.620               
group(no Having)     0.739                0.807                0.780                0.653                0.753               
group                0.739                0.739                0.780                0.653                0.722               
order                0.679                0.622                0.900                0.787                0.744               
and/or               1.000                0.970                0.964                0.925                0.969               
IUEN                 0.000                0.000                0.351                0.333                0.329               
keywords             0.881                0.866                0.734                0.729                0.816               
---------------------- PARTIAL MATCHING RECALL ----------------------
select               0.890                0.761                0.925                0.762                0.820               
select(no AGG)       0.915                0.786                0.943                0.774                0.841               
where                0.731                0.674                0.435                0.388                0.582               
where(no OP)         0.750                0.685                0.511                0.500                0.628               
group(no Having)     0.850                0.733                0.821                0.610                0.719               
group                0.850                0.672                0.821                0.610                0.689               
order                0.864                0.613                0.763                0.747                0.719               
and/or               0.980                0.995                0.964                0.949                0.979               
IUEN                 0.000                0.000                0.310                0.278                0.295               
keywords             0.933                0.849                0.713                0.720                0.811               
---------------------- PARTIAL MATCHING F1 --------------------------
select               0.890                0.761                0.925                0.764                0.820               
select(no AGG)       0.915                0.786                0.943                0.776                0.842               
where                0.715                0.658                0.435                0.404                0.578               
where(no OP)         0.733                0.668                0.511                0.521                0.624               
group(no Having)     0.791                0.768                0.800                0.631                0.736               
group                0.791                0.704                0.800                0.631                0.705               
order                0.760                0.617                0.826                0.766                0.732               
and/or               0.990                0.983                0.964                0.937                0.974               
IUEN                 1.000                1.000                0.329                0.303                0.311               
keywords             0.906                0.858                0.723                0.725                0.814               
JasperGuo commented 4 years ago

We have tried to predict values by using a pointer network to select a span from the question. In most cases, it can work. But there exists some cases where the value is not explicitly mentioned in the question. Consider the case below, the value T is not mentioned in the question.

Q: How many official languages does Afghanistan have?
S: SELECT COUNT(*) FROM country AS T1 JOIN countrylanguage AS T2 ON T1.Code  =  T2.CountryCode WHERE T1.Name  =  "Afghanistan" AND IsOfficial  =  "T"
brunnurs commented 4 years ago

Agree, it's definitely a challenging task. But to use Text2SQL technology in real world application it is crucial to also consider Values I guess. I might have a further look into this problem.

489597448 commented 4 years ago

why your length of dev for all is only 1028 while the required length is 1034?

JasperGuo commented 4 years ago

There are 6 cases where IRNet currently cannot support. When evaluating the performance of IRNet, just simply treat them as incorrect.

jaydeepb-inexture commented 4 years ago

@JasperGuo can you explain me the sketch_accuracy in this?? and when i run train model with beam size =10 i reach 50+ accuracy ,comparing to beam size =1 ,when i got max 48%.Can you exaplin this?? i have attach image of my epochs log during training. epoch = 41 accuracy=50..

epochs

saved_model_models1582866723_epoch.log