Closed raylutz closed 4 months ago
load_data()
dtypes is initialized from passed value but no conversion of dtypes is performed nor any unflattening. python CSV import is very fast but does no type conversion.unflatten
is normally True and all list or dict items will be unflattened, if they are JSON.Used fully explicit approach. .apply_dtypes() is used after reading csv file or importing data dtypes argument can be used to define all types and columns, or to specify subset to be cast to new types specified. from_str, if true (default) will not attempt to coerce str types as all columns will start as str type. unflatten, if true (default) will also unflatten any columns specified as list or dict types.
.flatten() used before writing files: Will flatten to JSON any dict or list types. Will convert bool to int.
Other methods were deprecated and removed. Documentation and tests updated.
Fixed in https://github.com/raylutz/daffodil/commit/b6c352a59cdc0cc8fb8f58cd2471a9ee79e495a5
One of the key reasons reading and writing data to csv files can take a longer period of time, is due to the need to convert data.
Reading CSV files without any data type conversion results in str data. dtypes dict can be used to set the datatypes after the data is initially read in.
There are currently methods to unflatten and apply_dtypes. It is not necessary to un-apply dtypes. Converting to csv automatically converts to the proper forms, except that internally, True and False should use integer forms. To correctly handle objects like dict or list in a given csv cell, then that data must be flattened. Unflattening can be an option in apply_dtypes. The daf object should record both unflattened and dtypes_applied states.
dtypes.py can be created for a given project using Daf. Each type of table can have a 'name_dtype' definition. These definitions are available to the code when the file is imported. It is also commonly useful to have the definitions included in a dict with key of the name of the table.
Is it worth having a method to register dtypes? Leaning against it..
The other thing that might be useful is defaults for any missing data. That can be in name_defaults definition, a dict which provides the default value, for those that have defaults.