raylutz / daffodil

Python Daffodil: 2-D data arrays with mixed types, lightweight package, can be faster than Pandas
MIT License
7 stars 2 forks source link

Formalize methodology for converting data types and flattening #10

Closed raylutz closed 4 months ago

raylutz commented 4 months ago

One of the key reasons reading and writing data to csv files can take a longer period of time, is due to the need to convert data.

Reading CSV files without any data type conversion results in str data. dtypes dict can be used to set the datatypes after the data is initially read in.

  1. No matter what, string types are created by Python, so it seems not much benefit to doing it right away vs. doing it later.
  2. In some cases, the usage of the data may not care about correcting those data types, and then it may want to select rows and create a smaller daf object, and then correct the types of that object.
  3. If a daf dataframe dtypes have been globally corrected, then this state should be set in the object, so it will not be done again. Any selection of rows from the parent object will inherit the state.
  4. str data types need not be corrected if read from csv file, as they will be correctly initialized as str.
  5. Any int or float types will need to be converted from str type. Decimal type not currently supported.
  6. list or dict types should be unflattened during dtypes conversion, if they are str type and start with either [ or { accordingly.
  7. bool data type should be represented internally as 0 or 1 and externally as '0' or '1' in csv files. Do not use "False" and "True" strings. But on reading, any type should be processed correctly.
  8. missing data is commonly expressed as a null string '' when converted from csv. Boolean data will convert to 0. For float and int, it will be useful to convert this to a compatible form, so the code does not need to constantly check for both string and numeric forms.

There are currently methods to unflatten and apply_dtypes. It is not necessary to un-apply dtypes. Converting to csv automatically converts to the proper forms, except that internally, True and False should use integer forms. To correctly handle objects like dict or list in a given csv cell, then that data must be flattened. Unflattening can be an option in apply_dtypes. The daf object should record both unflattened and dtypes_applied states.

dtypes.py can be created for a given project using Daf. Each type of table can have a 'name_dtype' definition. These definitions are available to the code when the file is imported. It is also commonly useful to have the definitions included in a dict with key of the name of the table.

Is it worth having a method to register dtypes? Leaning against it..

The other thing that might be useful is defaults for any missing data. That can be in name_defaults definition, a dict which provides the default value, for those that have defaults.

raylutz commented 4 months ago
  1. When data is read in using load_data() dtypes is initialized from passed value but no conversion of dtypes is performed nor any unflattening. python CSV import is very fast but does no type conversion.
  2. my_daf.apply_dtypes() will convert to specified types and skip all str types, if initially_all_str_type is True. unflatten is normally True and all list or dict items will be unflattened, if they are JSON.
  3. Before data is written my_daf.flatten() will JSONify any list or dict items.
  4. Standard is for bool items to be expressed as integer 0,1 False, True.
  5. For now, do not push conversions into DB.load_data. Best not to set dtypes_dict at all until after data is returned.
  6. higher-level functions, like load_table_daf() in chunktable can use enable_apply_dtypes and unflatten flags, to allow a table to be fully read in and nothing converted, if the fields of interest do not need any conversion.
  7. Somehow need to deal with caching so that fully converted table can be cached, and avoid converting again.
  8. Toyed with setting cols value for apply_dtypes so that only some of the columns are set. The problem here and with the prior point is it will be harder to understand what version is cached.
raylutz commented 4 months ago

Used fully explicit approach. .apply_dtypes() is used after reading csv file or importing data dtypes argument can be used to define all types and columns, or to specify subset to be cast to new types specified. from_str, if true (default) will not attempt to coerce str types as all columns will start as str type. unflatten, if true (default) will also unflatten any columns specified as list or dict types.

.flatten() used before writing files: Will flatten to JSON any dict or list types. Will convert bool to int.

Other methods were deprecated and removed. Documentation and tests updated.

Fixed in https://github.com/raylutz/daffodil/commit/b6c352a59cdc0cc8fb8f58cd2471a9ee79e495a5