SQL specialized vs LCM benchmark question

Thanks for reading the paper. To answer your questions:

I did not understand the difference between SQL specialized vs Long Context Model metrics for the SQL task.

The metric for both settings is execution accuracy, meaning whether or not the predicted answer matches the output of executing the gold SQL query.

For the specialized model, we take the predicted SQL query, execute it, and compare it to the execution of the gold SQL query.
For the long-context model, we prompt it to directly reason and answer over the tables in natural language without going through SQL. Then we take the final answer and compare it to the output of executing the gold SQL query. See Table 9 for what this generation looks like.

Also DAIL has 86.5% accuracy on Spider-Web. Why do you put the accuracy to 70% instead?

Different eval set: The 86.5% number you refer to I believe is on the test set of Spider. In LOFT, we do not use the entirety of the Spider set for eval. For each sized subset of LOFT (32k, 128k, 1M), we select 100 queries as the test set and these 100 queries differ for each sized subset (see Sec 2 in paper). Moreover, these 100 queries are taken from the union of the train and validation set of Spider, not the test set as we didn't have the labels to these.

Implementation: We were inspired by DAIL to prompt a LLM as a semantic parser but we do not use the same prompt as DAIL and use Gemini instead of GPT-4 as the semantic parser. Also DAIL uses different few-shot examples for each test query while we use a fixed set of examples (see Sec 4.5 in the paper).

google-deepmind / loft

SQL specialized vs LCM benchmark question #1