tdoehmen / gitschemas

Creative Commons Zero v1.0 Universal
11 stars 0 forks source link

Dataset of parsed and unparsed SQL files #1

Open ivbeg opened 2 years ago

ivbeg commented 2 years ago

Hi Till!

Great project! I am impressed by the dataset and research paper. Could you please add a raw dataset of collect SQL files, including parsed and unparsed SQL files? It could be helpful for future research and the development of universal SQL parsing tools.

Best Regards, Ivan

tdoehmen commented 2 years ago

Hi Ivan,

glad to hear that you liked our work.

What I can share right now is a list of ~600k URLs that point to raw .SQL files on GitHub. I've uploaded it here.

We used pglast for parsing, which succeeded for about 15% of the files. A great project which recently helped us to achieve a ~50% success rate on the same corpus is simple-ddl-parser - it's worth having a look at this project if you haven't already.

Developing a more robust or universal CSV parser is certainly a very interesting topic. If you are interested to discuss this in more detail, feel free to reach out to me by email (t.r.dohmen at uva.nl).

Best, Till

ivbeg commented 2 years ago

@tdoehmen, thanks for sharing! I didn't know about simple-ddl-parser; it could be helpful. Yes, I consider the development/adaptation of the existing SQL parser. My goal is a bit different, I have huge SQL dumps sometimes and I would like to convert them to CSV/JSONl without an RDBMS instance and I am working on several cmd tools and data engineering projects with many SQL, CSV, and other data and schema file types.

But universal SQL and CSV parsers are a very interesting topic to me too. I will email you after some tests over the dataset of SQL files.