Closed kanseaveg closed 11 months ago
Thanks for your attention.
Traditionally, both RATSQL and LGESQL perform schema linking at the word level on account of more accurate schema linking. And a typical procedure generally follows:
However, as you may think, the procedure above is rather complicated. And the word tokenizer (Step 1) will significantly affect the final performance (especially on Chinese corpus). To this end, we try to simplify it in ASTormer:
Here comes your concerns, how to perform schema linking if we use this simplified procedure (Step 1 and Step 2). Our hypothesis is that: schema linking performed at the sub-word/wordpiece level will create many false-positive linkings. For example, the sub-token "##man" will lead to a false partial match between token "##man" in question ["who", "is", "this", "business", "##man", "?"] and column ["wo", "##man", "name"] (just fabricated case). This problem is more severe in Chinese where each char is considered a sub-token. Thus, for schema linking, we still use word-level matchings. That is, we enumerate all possible word spans by composing all sub-token combinations to perform the schema linking (excluding partial spans like "##man" or "##man name" to ensure the well-formedness, see preprocess/preprocess_utils.py
line 240-243).
BTW, the reason why I open-source this Astormer framework is that most grammar-based text-to-SQL parsers rely heavily on very complicated and time-consuming preprocessing tricks, which makes it hard to generalize from one benchmark to another. Thus, we try to simplify the entire procedure, and propose a succint, efficient and unified grammar-based neural-symbolic framework. However, we must admit that dataset-oriented preprocessing really helps to achieve better performance. Thus, ASTormer is still inferior to its grammar-based competitors like STAR, CQR-SQL, HIESQL, S2SQL, LGESQL, etc. But ASTormer can be very easily adapted to other benchmarks and function as a simple baseline, and we experiment on five prevalent cross-domain text-to-SQL benchmarks, including English and Chinese.
This work is finished in 2022 and will not be maintained any more. For SOTA performances, I recommend using LLM with ICL (though on Chinese benchmarks, small-sized models are still better given enough training data?) or fine-tune a token-based parser with billion parameters like RESDSQL if hardware permits. If you just need one baseline, feel free to use ASTormer and train one from scratch.
Thank you for your reply! You are truly a meticulous and conscientious person and your explanation is also very detailed! BTW, Can I ask you one more question? Have you ever researched subword relation matrix propagation like graphix?
They did not use the steps5 word-level LSTM mentioned above, and additionally introduced a bridge node and discarded some non-local edges. Their code is really bad and messy, and most of it was copied from your LGESQL without citation.🤣 I want to ask, compared to the method of propagating subword relationship matrix and the method of subword aggregation/subword pooling in word-level LSTM, which one is more superior? Or, in other words, are these two methods equivalent in the process of q-s linking? Thank you very much for your reply again! We also welcome you to our school for lectures, interviews, and exchanges.😄
I did not experimentally compare these two methods (sub-word pooling layer and sub-word relation matrix). And from my perspective, they are usually adopted in different parser types.
For grammar-based parsers, their model architecture typically follows the paradigm: BERT-like encoder + GNN layers + a customized AST decoder. Since the decoder unit is mostly an action (such as ApplyRule or SelectTable), and many grammar-based decoders does not generate SQL values, parsers in this neural-symbolic branch usually adopt the sub-word pooling method to introduce more preprocessing or dataset bias in the word vocabulary distribution. While in ASTormer, we also generate SQL values and re-use the tokenized subwords for SQL value generation. Thus, we remove the sub-word pooling layer, also to advocate the sub-word copy operation in the decoder. However, as you can see, the performance is inferior to LGESQL (probably partly due to the fewer inductive biases introduced in the symbolic words).
For token-based parsers (such as GraphixT5 you mentioned), to the best of my knowledge, they usually adopt a encoder-decoder architecture like T5. The decoder unit at each timestep is subword and shares the same vocabulary with the encoder. Thus, it is more suitable to maintain the subword-level representation without introducing any sub-word pooling layer to be compatible with the pre-training stage. Generally, there are 3 strategies to introduce the schema linking information in a token-based parser:
In summary, here are my thoughts:
Thank you very much for your reply! It is really thoughtful and meticulous! You are also welcome to come to our school for exchange and sharing. Thank you very, very much!
Hello~ rhythm~ I want to ask why word level schema linking is used instead of using an subword level schema linking process. I found a curious problem while reading the code. If PLM is used for tokenizing, would schema linking at the subword level be better than at the word level? Have you ever tried some tests about it? Thank you very much for your reply.
Because I found in LGESQL that there is indeed aggregation and pooling function at the subword token-level. So I am curious why RAT-SQL only needs word level. In addition, your lgesql is written very well. It is a rare and foundational article for the text-to-SQL task. Thank you very much for your contribution to the community.