microsoft / rat-sql

A relation-aware semantic parsing model from English to SQL
https://arxiv.org/abs/1911.04942
MIT License
406 stars 117 forks source link

Schema Modeling #68

Closed manzambi11 closed 2 years ago

manzambi11 commented 2 years ago

I want to understand how database schema is encoded. My intuition is that, if we encoded the schema of the training set when dealing with unseen data, the model will struggle to encode them because of UNKNOW. Let me give example for further understanding of my point:

If in my training set I have these details as schema: T_Student, student_name, student_age, student_option. and in my validation set, I have this schema not present in my training set: table_student, table_name, table_age, all details will be encoded as UNKNOW.

Can someone help me to understand how RAT is dealing with unseen schema and how the schema encoding is done?

rafiip commented 2 years ago

I believe you are confused about the concept of unseen schema. The database schema is always known to the model, Whether you are in the training or testing phase. The model's schema during the testing phase is not seen before, meaning internal relations and/or tokens are a novelty to the model. This does not mean that the model cannot encode them.

manzambi11 commented 2 years ago

I didn't see this hypothesis even in the article that the schema is always known in the model.

In the article, the challenge in Text to SQL is to jointly encode the question and schema to optimize the generalization for the model be able to have a high accuracy even in unseen schema.

I can use the pretrained RAT with my own dataset. How my schema will be encoded? Because column and table names in database are not often in natural language. For : t_name (is not a NL) for name.

So how RAT handle this problem?

manzambi11 commented 2 years ago

I think This ticket can be closed. After reading, I understand that BERT can handle UNKNOW words by splitting words into characters, so any words will have an embedding vector