jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
546 stars 106 forks source link

Error for wikisql.json and spider.json while generating test,train and dev split #46

Closed anshudaur closed 4 years ago

anshudaur commented 4 years ago

Hi,

I am getting below mentioned unicode encode error while running json_to_flat.py file to generate split on both wikisql and spider dataset :

json_to_flat.py", line 50, in convert_instance print(text, "|||", sql, file=output_file) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\uff1f' in position 123: character maps to

Please help.

Best Regards Anshu

jkkummerfeld commented 4 years ago

Are you using python 2? On my Mac I found that it worked with Python 3, but not with Python 2.

anshudaur commented 4 years ago

HI Jonathan, I am using python 3.7.6 on windows machine. Thanks Anshu

anshudaur commented 4 years ago

HI, I was able to generate split on advising and atis dataset. But i am getting this error for wikisql and spider. Also, the train set gets generated as i can see for what line in json it actually failed. for spider.json file, it failes after : convert_instance --- {'query-split': 'N/A', 'sentences': [{'database': 'world_1', 'original': 'What is the total population and average area of countries in the continent of North America whose area is bigger than 3000?', 'question-split': 'dev', 'text': 'What is the total population and average area of countries in the continent of var0 whose area is bigger than var1?', 'variables': {'var0': 'North America', 'var1': '3000'}}, {'database': 'world_1', 'original': 'Give the total population and average surface area corresponding to countries in Noth America that have a surface area greater than 3000.', 'question-split': 'dev', 'text': 'Give the total population and average surface area corresponding to countries in Noth America that have a surface area greater than var1 .', 'variables': {'var1': '3000'}}], 'sql': ['SELECT AVG( SURFACEAREA ) , SUM( COUNTRYalias0.POPULATION ) FROM COUNTRY AS COUNTRYalias0 WHERE COUNTRYalias0.CONTINENT = "var0" AND SURFACEAREA > var1 ;'], 'sql-original': ['SELECT sum(Population) , avg(SurfaceArea) FROM country WHERE Continent = "North America" AND SurfaceArea > 3000'], 'variables': [{'example': 'North America', 'location': 'both', 'name': 'var0', 'type': 'unknown'}, {'example': '3000', 'location': 'both', 'name': 'var1', 'type': 'unknown'}]}

So i think the error is not related to python(3.7). Thanks you so much :) Best Regards Anshu

jkkummerfeld commented 4 years ago

It's definitely a unicode compatibility issue. I've just pushed an update that replaces all unicode characters in Spider with their ascii equivalents.

For WikiSQL the problem is trickier. There are a lot of unicode characters in there that can't be easily replaced without losing information (e.g. diacritical marks). I would suggest narrowing it down to one example (as you did above) then doing some searching about the specific unicode character causing problems.

anshudaur commented 4 years ago

HI, Thank you so much. I tried with the new spider.json file to generate splits for query-split, but now only train split is getting generated and rest of them are empty.

Best Anshu

jkkummerfeld commented 4 years ago

Hm, I'm not seeing that behaviour. I get both train and dev. Here is exactly what I ran and what I got:

~/Downloads/text2sql-data (master)$ echo data/spider.json | python ./tools/json_to_flat.py spider
~/Downloads/text2sql-data (master)$ ls -l | grep 'spider'
-rw-r--r--   1 jkk  staff   262067 Mar 31 10:54 spider.dev
-rw-r--r--   1 jkk  staff        0 Mar 31 10:54 spider.test
-rw-r--r--   1 jkk  staff  2386193 Mar 31 10:54 spider.train

Note that our data does not contain the Spider test set, that is kept private by the Yale group that developed it.

anshudaur commented 4 years ago

HI Jonathon Thank you so much! I got the files converted from a Mac machine, as fixing the error in windows was taking a lot of time. For spider there is no dev set for query split, is it correct? And for wikisql query split, there is only 1 training split and no dev and test splits for it. Can you please confirm both the details is the below screenshot is correct(just to know that i got the right files)?

image

Thanks and Best Regards Anshu

jkkummerfeld commented 4 years ago

I just applied the process on my machine and got the same file sizes.

For WikiSQL, that's correct. Our query split definition doesn't apply effectively to it (though more general forms of templating could). For Spider, we did not use it in our paper (it was published later), so we did not define a query split.

Sounds like this is resolved, so I'm going to close this issue. Good luck with your work!

anshudaur commented 4 years ago

Thanks :)