Help Wanted！Why did you choose 'word-level schema linking' instead of 'token-level schema linking'?

kanseaveg commented 11 months ago

Hello~ rhythm~ I want to ask why word level schema linking is used instead of using an subword level schema linking process. I found a curious problem while reading the code. If PLM is used for tokenizing, would schema linking at the subword level be better than at the word level? Have you ever tried some tests about it? Thank you very much for your reply.

Because I found in LGESQL that there is indeed aggregation and pooling function at the subword token-level. So I am curious why RAT-SQL only needs word level. In addition, your lgesql is written very well. It is a rare and foundational article for the text-to-SQL task. Thank you very much for your contribution to the community.

rhythmcao commented 11 months ago

Thanks for your attention.

Traditionally, both RATSQL and LGESQL perform schema linking at the word level on account of more accurate schema linking. And a typical procedure generally follows:

Word tokenization for both question and schema items (additionally word-lowercase and lemmatization): Question: "How many students are there?" -> ["how", "many", "student", "be", "there", "?"] Schema items: (such as) column "student names" -> ["student", "name"]
Perform schema linking at the word level, such that word "student" in question is linked to column "student names"
For PLM input tokenization, individually tokenize each word using PLM tokenizer to preserve the word-level schema linking information, e.g., "bussinessman" -> ["bussiness", "##man"]
After PLM encoding, perform subword feature aggregation/pooling to obtain word-level representations.
Passing word-level features into a word-level LSTM (for question words/schema words respectively) to get contextual information
The graph neural network (RGAT for RATSQL or Dual-RGAT for LGESQL) is utilized on the output of LSTMs. In other words, each question word is a node, while each schema item (the whole table/column name) is a node in the GNN.

However, as you may think, the procedure above is rather complicated. And the word tokenizer (Step 1) will significantly affect the final performance (especially on Chinese corpus). To this end, we try to simplify it in ASTormer:

In Step 3 above, tokenize the entire question/table name/column name using PLM tokenizer instead of tokenizing each pre-split word;
Removing Step 4, that is completely removing the subword aggregation layer;
In Step 5, the LSTMs accept sub-token encodings instead of word-level features this time;
In Step 6, w.r.t. nodes in GNN, each question sub-word/wordpiece is a node in GNN (changed from word to sub-word), while each schema item (the whole table/column name) is a node in the GNN (this part is not changed, since we still concatenate the head+tail output vectors of schema LSTM). As we claim, both the preprocessing and encoding time can be reduced following this simplified procedure.

Here comes your concerns, how to perform schema linking if we use this simplified procedure (Step 1 and Step 2). Our hypothesis is that: schema linking performed at the sub-word/wordpiece level will create many false-positive linkings. For example, the sub-token "##man" will lead to a false partial match between token "##man" in question ["who", "is", "this", "business", "##man", "?"] and column ["wo", "##man", "name"] (just fabricated case). This problem is more severe in Chinese where each char is considered a sub-token. Thus, for schema linking, we still use word-level matchings. That is, we enumerate all possible word spans by composing all sub-token combinations to perform the schema linking (excluding partial spans like "##man" or "##man name" to ensure the well-formedness, see preprocess/preprocess_utils.py line 240-243).

BTW, the reason why I open-source this Astormer framework is that most grammar-based text-to-SQL parsers rely heavily on very complicated and time-consuming preprocessing tricks, which makes it hard to generalize from one benchmark to another. Thus, we try to simplify the entire procedure, and propose a succint, efficient and unified grammar-based neural-symbolic framework. However, we must admit that dataset-oriented preprocessing really helps to achieve better performance. Thus, ASTormer is still inferior to its grammar-based competitors like STAR, CQR-SQL, HIESQL, S2SQL, LGESQL, etc. But ASTormer can be very easily adapted to other benchmarks and function as a simple baseline, and we experiment on five prevalent cross-domain text-to-SQL benchmarks, including English and Chinese.

This work is finished in 2022 and will not be maintained any more. For SOTA performances, I recommend using LLM with ICL (though on Chinese benchmarks, small-sized models are still better given enough training data?) or fine-tune a token-based parser with billion parameters like RESDSQL if hardware permits. If you just need one baseline, feel free to use ASTormer and train one from scratch.

kanseaveg commented 11 months ago

Thank you for your reply! You are truly a meticulous and conscientious person and your explanation is also very detailed! BTW, Can I ask you one more question? Have you ever researched subword relation matrix propagation like graphix?

They did not use the steps5 word-level LSTM mentioned above, and additionally introduced a bridge node and discarded some non-local edges. Their code is really bad and messy, and most of it was copied from your LGESQL without citation.🤣 I want to ask, compared to the method of propagating subword relationship matrix and the method of subword aggregation/subword pooling in word-level LSTM, which one is more superior? Or, in other words, are these two methods equivalent in the process of q-s linking? Thank you very much for your reply again! We also welcome you to our school for lectures, interviews, and exchanges.😄

rhythmcao commented 11 months ago

I did not experimentally compare these two methods (sub-word pooling layer and sub-word relation matrix). And from my perspective, they are usually adopted in different parser types.

For grammar-based parsers, their model architecture typically follows the paradigm: BERT-like encoder + GNN layers + a customized AST decoder. Since the decoder unit is mostly an action (such as ApplyRule or SelectTable), and many grammar-based decoders does not generate SQL values, parsers in this neural-symbolic branch usually adopt the sub-word pooling method to introduce more preprocessing or dataset bias in the word vocabulary distribution. While in ASTormer, we also generate SQL values and re-use the tokenized subwords for SQL value generation. Thus, we remove the sub-word pooling layer, also to advocate the sub-word copy operation in the decoder. However, as you can see, the performance is inferior to LGESQL (probably partly due to the fewer inductive biases introduced in the symbolic words).

For token-based parsers (such as GraphixT5 you mentioned), to the best of my knowledge, they usually adopt a encoder-decoder architecture like T5. The decoder unit at each timestep is subword and shares the same vocabulary with the encoder. Thus, it is more suitable to maintain the subword-level representation without introducing any sub-word pooling layer to be compatible with the pre-training stage. Generally, there are 3 strategies to introduce the schema linking information in a token-based parser:

In RASAT, they pre-calculate the schema linking information based on words, and then broadcast these prior knowledge to subwords via introducing relative position embeddings in each T5 encoding layer.
In GraphixT5, without modifing the original T5 encoder layers, they introduce an auxiliary graph encoder to calculate the graph node representations, and add these representations to the original T5-encoded vectors. (Introducing an extra bridge node is one way to reduce computation in the graph encoder.)
In UniASr and UnifiedSKG, they even do not modify the encoder architecture and just inject the schema linking information as prompts into the serialized input. (Actually, I prefer this simple method in token-based parser.)

In summary, here are my thoughts:

for grammar-based parsers, according to the comparison between ASTormer and LGESQL, it is more suitable to apply sub-word pooling strategy to introduce more symbolic biases if you chase more competitive performances on one specific dataset. It is worth mentioning that this subword-to-word practice may sacrifice the universality (that is also the reason why we abandon the subword pooling layer in ASTormer for a more generic framework).
for token-based parsers, it is suggested that we maintain sub-word level representations to re-use the learned pre-training information as possible. I recommend injecting the schema linking information without modifying the model architecture for its simplicity and ease of use. One evidence is that, the most advanced non-LLM parser RESDSQL even does not heavily rely on schema linking.
we do not need to focus on these detailed implementations in the new era of LLMs. Those benefits introduced by complicated model architecture design can be easily eaten by the power law and emergent ability of LLMs. If you want to use small local models for time and cost efficiency, please refer to the first two thoughts.

kanseaveg commented 11 months ago

Thank you very much for your reply! It is really thoughtful and meticulous! You are also welcome to come to our school for exchange and sharing. Thank you very, very much!

rhythmcao / text-to-sql-astormer

Help Wanted！Why did you choose 'word-level schema linking' instead of 'token-level schema linking'? #2