salesforce / WikiSQL

A large annotated semantic parsing corpus for developing natural language interfaces.
BSD 3-Clause "New" or "Revised" License
1.64k stars 323 forks source link

Using this method on my own SQL database #19

Closed Sandy26 closed 6 years ago

Sandy26 commented 6 years ago

Hi @vzhong,

I would like to get inferences for my own SQL table. For a similar question asked in Jan'18 you replied-

"You would have to train a model on this data, then perform inference on your data. Xiaojun and Chang from Berkeley has kindly made their model available here: https://github.com/xiaojunxu/SQLNet".

I am a little confused. Won't I need to train on my own SQL table? my column names could be very different. Won't need to create .jsonl files like you have in "data" directory? Could you please help me understand you comment above?

Thank you, Shruti

vzhong commented 6 years ago

Hi Shruti,

This really depends on your data distribution. If it is very similar WikiSQL, then you don't need to retrain. In any event, it should always help to retrain/finetune on your own dataset.

You need to convert your table schema and queries to the format found in the .json and .jsonl files in the data directory. Once you've done that, you can use a model like Seq2SQL or SQLNet or whatever you would like to train and do inference on your own data.

Sandy26 commented 6 years ago

Hi Victor, I believe my dataset is every different from WikiSQL dataset so I will have to train from scratch.

I see only .jsonl files in the data directory. What sort of files are .json? Also why are train.jsonl and test.jsonl in the exact same format? Shouldn't the test.jsonl only have a question and table schema? why does it have sql part too? Shouldn't we be predicting that with model? Do you have a sample file with just test question and table schema that I can give to the model and get and sql query output?

This is what I understand so far. This is supervised learning process with following steps-(Please feel to correct me)- Step 1:Take files in data/train.jsonl and data/dev.jsonl and create annotate_ent/train.jsonl and annotate_ent/dev.jsonl files using annotate.py

Step 2: Train a model m.pt with the above files in annotate_ent

Step 3: Run annotate.py on test.jsonl (Not really sure what this file should look like but ideally This file will have my question and table schema.) and create annotate_ent/test.jsonl

Step 4: Then predict sql query from the file in Step 3 using model from step2

But I am running in bit of circles here as to why your test.jsonl also has "sql" part. If you could help clear the above steps that would be great!

Thank you very much, Shruti

vzhong commented 6 years ago

The annotated labels are in the test split so we can evaluate the test set predictions. The model should not need test labels to make inference. Again, we don’t provide the model here. You can refer to the SQLNet repo for reference implementations.

On Wed, May 30, 2018 at 11:47 PM Sandy26 notifications@github.com wrote:

Hi Victor, I believe my dataset is every different from WikiSQL dataset so I will have to train from scratch.

I see only .jsonl files in the data directory. What sort of files are .json? Also why are train.jsonl and test.jsonl in the exact same format? Shouldn't the test.jsonl only have a question and table schema? why does it have sql part too? Shouldn't we be predicting that with model? Do you have a sample file with just test question and table schema that I can give to the model and get and sql query output?

This is what I understand so far. This is supervised learning process with following steps-(Please feel to correct me)- Step 1:Take files in data/train.jsonl and data/dev.jsonl and create annotate_ent/train.jsonl and annotate_ent/dev.jsonl files using annotate.py

Step 2: Train a model m.pt with the above files in annotate_ent

Step 3: Run annotate.py on test.jsonl (Not really sure what this file should look like but ideally This file will have my question and table schema.) and create annotate_ent/test.jsonl

Step 4: Then predict sql query from the file in Step 3 using model from step2

But I am running in bit of circles here as to why your test.jsonl also has "sql" part. If you could help clear the above steps that would be great!

Thank you very much, Shruti

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/WikiSQL/issues/19#issuecomment-393426004, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxPHCBcv6sA_0Pkhn6PvNn41nutVr5zks5t35IIgaJpZM4UUbkI .

Sandy26 commented 6 years ago

Hi Victor, Model is not a problem. Do the steps mentioned above make sense? What I am confused is you have given an example prediction file-example.pred.dev.jsonl . Which function in your code actually creates this file? I was unable to locate it. If I get that, that would very helpful. Thank you, Shruti

vzhong commented 6 years ago

That is an example prediction file that can be used with the evaluation script. You can disregard this file for your purpose because you only need to run inference.

On Thu, May 31, 2018, 12:08 AM Sandy26 notifications@github.com wrote:

Hi Victor, Model is not a problem. Do the steps mentioned above make sense? What I am confused is you have given an example prediction file-example.pred.dev.jsonl . Which function in your code actually creates this file? I was unable to locate it. If I get that, that would very helpful. Thank you, Shruti

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/WikiSQL/issues/19#issuecomment-393431236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxPHMc7Tkvww70fENhf1Ni0sxXFkl9Yks5t35cLgaJpZM4UUbkI .

Sandy26 commented 6 years ago

Sure, but then to run inference, do you have a sample test file ? Or should it be like- { "question":"who is the manufacturer for the order year 1998?", "header":[ "State/territory", "Text/background colour", "Format", "Current slogan", "Current series", "Notes" ], "table_id":"1-10007452-3" }

And what if I have several tables and not sure which table to run query on?

vzhong commented 6 years ago

Your test file format will likely be model specific. A good starting point is the jsonl files in this repo (but without the labels). In this work we assume that you already know which table the question is asking about.

On Thu, May 31, 2018, 12:20 AM Sandy26 notifications@github.com wrote:

Sure, but then to run inference, do you have a sample test file ? Or should it be like- { "question":"who is the manufacturer for the order year 1998?", "header":[ "State/territory", "Text/background colour", "Format", "Current slogan", "Current series", "Notes" ], "table_id":"1-10007452-3" }

And what if I have several tables and not sure which table to run query on?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/WikiSQL/issues/19#issuecomment-393434034, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxPHI_FeyUY_zT6JKuzskFWENOYw95dks5t35nPgaJpZM4UUbkI .

Sandy26 commented 6 years ago

Thank you Victor. It was helpful talking to you! Good Luck with your research!

vzhong commented 6 years ago

Thank you!

kanishkaashish commented 4 years ago

Hi Shruti I'm working on the State of M.P police project and want to implement the seq2sql model on our database. Can you guide me what are the changes I have to make on my database to use it in the place of Wikisql?