google-deepmind / loft

LOFT: A 1 Million+ Token Long-Context Benchmark
Apache License 2.0
76 stars 2 forks source link

SQL specialized vs LCM benchmark question #1

Open vkaul11 opened 1 week ago

vkaul11 commented 1 week ago

Thanks for the paper. I did not understand the difference between SQL specialized vs Long Context Model metrics for the SQL task. For the SQL specialized pipeline do you use DAIL-SQL prompt with Gemini and compare it with normal "general" prompt from Gemini to directly reason and answer without going through SQL generation ? It would be good to get that clarification. Also DAIL has 86.5% accuracy on Spider-Web. Why do you put the accuracy to 70% instead?

anthonywchen commented 4 days ago

Thanks for reading the paper. To answer your questions:

I did not understand the difference between SQL specialized vs Long Context Model metrics for the SQL task.

The metric for both settings is execution accuracy, meaning whether or not the predicted answer matches the output of executing the gold SQL query.

Also DAIL has 86.5% accuracy on Spider-Web. Why do you put the accuracy to 70% instead?

  • Different eval set: The 86.5% number you refer to I believe is on the test set of Spider. In LOFT, we do not use the entirety of the Spider set for eval. For each sized subset of LOFT (32k, 128k, 1M), we select 100 queries as the test set and these 100 queries differ for each sized subset (see Sec 2 in paper). Moreover, these 100 queries are taken from the union of the train and validation set of Spider, not the test set as we didn't have the labels to these.
  • Implementation: We were inspired by DAIL to prompt a LLM as a semantic parser but we do not use the same prompt as DAIL and use Gemini instead of GPT-4 as the semantic parser. Also DAIL uses different few-shot examples for each test query while we use a fixed set of examples (see Sec 4.5 in the paper).