raylutz / daffodil

Python Daffodil: 2-D data arrays with mixed types, lightweight package, can be faster than Pandas
MIT License
7 stars 2 forks source link

Type coercion, unflattening, serialization, and flattening. #5

Closed raylutz closed 5 months ago

raylutz commented 7 months ago

The Python3 csv module does not support data type coercion: https://docs.python.org/3/library/csv.html

Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).

One of the reasons working with CSV files can be slow is because the data has been serialized to strings throughout. Eventually the data may need to be interpreted as non-string data types prior to comparison or use in calculations. On the other hand, some of the data may never be referenced, and coercion to datatypes can be skipped. However, doing it this way means the logic for coercing data types needs to be respected when it is utilized.

The decision between lazy coercion (delaying data type conversion until necessary) and immediate conversion (converting all data at once) depends on various factors, including the size of the dataset, memory constraints, and the specific use case.

Here are some considerations for each approach:

Lazy Coercion:

Pros:

Cons:

Immediate Conversion:

Pros:

Cons:

Further Discussion

  1. If lazy coercion is used, it is very easy to make data type errors. Thus, for conversion of base types (str, int, float), it is best to do that immediately. If not, then it is easy to make mistakes later, particularly since addition is defined for both strings and numbers, i.e. '2.0' + '1.4' will work just fine and create '2.01.4'.
  2. If the type is str, there may be further unflattening from JSON or similar format to create list or dict objects. That can happen as an additional step after basic conversion is completed.
  3. If the data is read from the cloud, it will need to be read as a block into memory. Local file access can benefit from interleaving disk access with conversion.

Immediate conversion must be available because it eliminates many human errors in using the data.

Options for faster CSV reading:

The following options need to be reviewed and verified. these may be AI hallucinations.

FastCSV

FastCSV is a Python library designed for efficient and fast CSV reading and writing. It supports type inference and allows specifying the data types of columns during reading, enabling immediate type conversion.

import fastcsv

# Define column data types
column_types = {'col1': int, 'col2': float, 'col3': str}

# Read CSV with specified data types
data = fastcsv.read_csv('data.csv', types=column_types)

Cython-CSV

Cython-CSV is a fast CSV parsing library for Python based on Cython. It provides type inference and supports specifying data types for columns during reading, enabling immediate type conversion.

from csvparser import CSVParser

# Define column data types
column_types = [int, float, str]

# Read CSV with specified data types
with open('data.csv', 'r') as file:
    parser = CSVParser(file, types=column_types)
    data = parser.read()

TurboCSV

TurboCSV is another Python library designed for fast CSV parsing with type inference and immediate type conversion capabilities. It aims to be faster than traditional CSV parsers like Pandas.

from turbocsv import TurboCSV

# Define column data types
column_types = {'col1': int, 'col2': float, 'col3': str}

# Read CSV with specified data types
with open('data.csv', 'r') as file:
    data = TurboCSV(file, types=column_types).read()

CleverCSV

Claims to have a better novel approach to guessing the dialect of a foreign csv file. This parser is more about getting it right than doing it fast.

https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf

@article{van2019wrangling, title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns}, author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.}, journal = {Data Mining and Knowledge Discovery}, year = {2019}, volume = {33}, number = {6}, pages = {1799--1820}, issn = {1573-756X}, doi = {10.1007/s10618-019-00646-y}, }

This paper boils down to this paragraph:

The code we use for our CSV parser borrows heavily from the CSV parser in the Python standard library, but differs in a few small but significant ways. First, our parser only interprets the escape character if it proceeds the delimiter, quote character, or itself. In any other case the escape character serves no purpose and is treated as any other character and is not dropped. Second, our parser only strips quotes from cells if they surround the entire cell, not if they occur within cells. This makes the parser more robust against misspecified quote characters. Finally, when we are in a quoted cell we automatically detect double quoting by looking ahead whenever we detect a quote, and checking if the next character is also a quote character. This enables us to drop double quoting from our dialect and only marginally affects the complexity of the code.
raylutz commented 7 months ago

https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d

This paper concludes that Python csv.DictReader() was the fastest, because it does not do any data types coercion. Also, the method of timing may not be accurate if there is unexpected buffering of data between runs, etc.

library time (secs) for 1 M rows
csv.DictReader 0.00013113021850585938
pd.read_csv 1.9808268547058105
pd.read_csv 2.1136152744293213
dask.dataframe 0.06910109519958496
datatable 0.13840913772583008
raylutz commented 7 months ago

Current design approach:

  1. Use csv module for initial conversion of csv into string fields. This deals intelligently with: a. dialect detection b. newlines in fields c. quoting and separator options.
  2. Allow for immediate but explicit conversion of non-string fields
  3. Detection of unusual data types when performing calculations, particularly to be able to skip '' null string values.
raylutz commented 5 months ago

Fixed with new apply_dtypes and flatten methods.

Fixed in https://github.com/raylutz/daffodil/commit/b6c352a59cdc0cc8fb8f58cd2471a9ee79e495a5