microsoft / IRNet

An algorithm for cross-domain NL2SQL
MIT License
260 stars 80 forks source link

Is there any effect on accuracy as number of columns and tables in one database increases? #41

Open TapanHP1995 opened 4 years ago

TapanHP1995 commented 4 years ago

I have added one more database other than a spider and trying to do prediction with my own queries that database is AdventureWorks which is popular and Big, It has 71 tables and 400+ columns overall and have many to many relationships among tables, Now When I incorporated that My train accuracy is going down to 10% only. Almost all queries are failing. Is there any limitation of IRNet in terms of database table/column size?

In spider, there are these databases having limited numbers of tables/columns. I am curious to know if anyone has tried with their own databases which are large and have some insights

brunnurs commented 4 years ago

This is a good question I think as most real world databases are clearly larger than the spider examples. Some thoughts:

I'm wondering if you used a Transformer encoder with your AdventureWorks database, because to my experience 71 tables and 400+ columns would not work with the transformer encoder that way. The BERT max sequence length e.g. is 512 tokens, which already gets critical for some database in Spider (e.g. have a look at "baseball_1" which is almost maxing it out or even needs more if you do wordpiece tokenization). I therefore guess you use the classical IRNet schema encoding based on RNNs which can handle any input size, but will then also drop in performance.

I think the best way to improve your performance is reducing the schema size. Abstract from the low level database schema and think about which entities and attributes your user will really use. Obviously you loose a bit of the original NLI-idea, but normally there are a lot technical details in a database a user would never ask for. If you then have a smaller, more abstract schema you simply need to map some technicalities in a deterministic post processing step. So if you for example build one abstract entity car for the tables _carmodel and _carparts, you might have to pre-process that accordingly before executing the query.

A second way to reduce the schema size is in splitting up the database in sub-clusters. Often some parts of the database are very closely coupled, try to identify those clusters. Once you found them you need to train a classifier which classifies your question into one of the clusters and only then start the text-to-SQL process.

A third way is not to reduce the schema size and using a transformer encoder which can handle much larger sizes. Have a look at this transformer approach which was released recently: https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html. But also here the performance might drop quite a bit as in the end there is still the PointerNetworks who need to select a column/table and the more options there is, the harder it gets.

Hope that helps.

Ajit-WingsBI commented 2 years ago

Hi @brunnurs Thanks for your descriptive answers...this was helpful...since this was 2 years before questions...wondering any updated info if you can share to resolve the issue which can help to resolve the issue ...or any better alternative for IRNET?