Closed raylutz closed 5 months ago
This paper concludes that Python csv.DictReader() was the fastest, because it does not do any data types coercion. Also, the method of timing may not be accurate if there is unexpected buffering of data between runs, etc.
library | time (secs) for 1 M rows |
---|---|
csv.DictReader | 0.00013113021850585938 |
pd.read_csv | 1.9808268547058105 |
pd.read_csv | 2.1136152744293213 |
dask.dataframe | 0.06910109519958496 |
datatable | 0.13840913772583008 |
Current design approach:
Fixed with new apply_dtypes and flatten methods.
Fixed in https://github.com/raylutz/daffodil/commit/b6c352a59cdc0cc8fb8f58cd2471a9ee79e495a5
The Python3 csv module does not support data type coercion: https://docs.python.org/3/library/csv.html
One of the reasons working with CSV files can be slow is because the data has been serialized to strings throughout. Eventually the data may need to be interpreted as non-string data types prior to comparison or use in calculations. On the other hand, some of the data may never be referenced, and coercion to datatypes can be skipped. However, doing it this way means the logic for coercing data types needs to be respected when it is utilized.
The decision between lazy coercion (delaying data type conversion until necessary) and immediate conversion (converting all data at once) depends on various factors, including the size of the dataset, memory constraints, and the specific use case.
Here are some considerations for each approach:
Lazy Coercion:
Pros:
Cons:
Immediate Conversion:
Pros:
Cons:
Further Discussion
Immediate conversion must be available because it eliminates many human errors in using the data.
Options for faster CSV reading:
The following options need to be reviewed and verified. these may be AI hallucinations.
FastCSV
FastCSV is a Python library designed for efficient and fast CSV reading and writing. It supports type inference and allows specifying the data types of columns during reading, enabling immediate type conversion.
Cython-CSV
Cython-CSV is a fast CSV parsing library for Python based on Cython. It provides type inference and supports specifying data types for columns during reading, enabling immediate type conversion.
TurboCSV
TurboCSV is another Python library designed for fast CSV parsing with type inference and immediate type conversion capabilities. It aims to be faster than traditional CSV parsers like Pandas.
CleverCSV
Claims to have a better novel approach to guessing the dialect of a foreign csv file. This parser is more about getting it right than doing it fast.
https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf
@article{van2019wrangling, title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns}, author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.}, journal = {Data Mining and Knowledge Discovery}, year = {2019}, volume = {33}, number = {6}, pages = {1799--1820}, issn = {1573-756X}, doi = {10.1007/s10618-019-00646-y}, }
This paper boils down to this paragraph: