jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit
https://qsv.dathere.com
The Unlicense
2.52k stars 71 forks source link

CSV dialect detection: implementation without third party libraries #2247

Open ws-garcia opened 3 weeks ago

ws-garcia commented 3 weeks ago

Discussed in https://github.com/jqnatividad/qsv/discussions/2246

Originally posted by **ws-garcia** October 25, 2024 ## Problem overview Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files. ## Details At the moment, @jqnatividad has begun digging into the problem and claiming > Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable? He pointed >Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go! The work path to go, until now, is outlined in https://github.com/jqnatividad/qsv-sniffer/issues/14. Currently, all tasks are under study but not completed. ## New path In this I will discuss a new approach to implement dialect detection in qsv using trivial elements: - __Regexes__: determine fields data types. - __Current implemented parser__: load data. - __Table Uniformity measure__: detect the table with the best structure. With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain [results with greater certainty](https://content.iospress.com/articles/data-science/ds240062). The process is as follows: - In the first phase, __potential dialects__ are built based on field/column separator, quotation marks, and record delimiter characters. In this stage user can provide custom delimiter list, giving the tool a level of flexibility. - With each potential dialect, we attempt to parse the CSV file and use the data to __construct temporary table__. - The table is scored using the __Table Uniformity measurement__. Each score is saved in a collection using the dialect as a key. - The dialect that produces the __table with the highest score__ is then selected as the desired one. A Python implementation of this exact approach is described in a [GitHub repository](https://github.com/ws-garcia/CSVsniffer/tree/main/python/src). The evaluation of this methods gives: |Tool |F1 score| |:-------------|:-------| |`CSVsniffer` |0.9260 | |`CleverCSV` |0.8425 | |`csv.Sniffer` |0.8049 | This sheds light over one point: the presented approach is clearly outperforming `csv.Sniffer` and also `CleverCSV` in the research datasets. Hoping this can help this wonderful project!

Edit:

Code snippet will be presented in the discussion.

jqnatividad commented 3 weeks ago

Thanks @ws-garcia !

This is very timely as I was dreading taking on the csv-sniffer python port, thus the lack of activity.

Your step-by-step "new path" breakdown is certainly easier to digest than the paper :)

Will be sure to loop you in as we mark progress...

ws-garcia commented 3 weeks ago

You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation.

jqnatividad commented 2 weeks ago

Hi @ws-garcia , just wanted to let you know that I'm thinking of implementing your paper as a Rust library given the utility of CSV dialect detection, as other developers may want to use your CSV dialect detection algorithm, and qsv is a command-line utility.

As the name csv-sniffer is already used by the apparently unmaintained crate, I'm thinking of naming it
csv-garciasniffer. :smile:

I will deprecate the existing qsv-sniffer csv-sniffer fork and use the new csv-garciasniffer crate once its implemented.

Thoughts?

ws-garcia commented 2 weeks ago

Hey @jqnatividad, I am honored that you have the idea of adding my name to the library. But there is a name that would sound great and promote the amazing product that is qsv: csv-qsniffer.

I continue to think that adding a high-precision dialect detector to qsv would be a great milestone for the project. So, go ahead with the library and its implementation!

jqnatividad commented 2 weeks ago

Great! 🎉 csv-qsniffer it is then! 🥳

Will keep you posted as we mark progress on implementing the library and integrating it into qsv and qsv pro.

ws-garcia commented 2 weeks ago

The research paper methodology will be soon published as Open Access under Creative Commons Attribution License (CC BY 4.0). You only need to give the copyright ©️. Let's go make qsv as infalible as posible!