treasure-data / td-client-python

Treasure Data API library for Python
Apache License 2.0
47 stars 24 forks source link

Add explicit type information for `BulkImport.upload_file` especially for CSV/TSV format. #83

Closed chezou closed 4 years ago

chezou commented 4 years ago

The current implementation for td-client-python’s CSV reader reads all fields as string and then convert type with trying to int() or float().

This logic causes type conversion string with leading zeros like ”00011” to 11. It'd be nice if we could keep numerical values with leading zero as string, so we need to introduce an explicit type option in BulkImport.upload_file() function like pandss dtypes.

chezou commented 4 years ago

A column type is defined by msgpack type of random sampled rows on bulk import API. While this option provides explicit type conversion from CSV to msgpack, users must use update_schema for ensuring schema type after bulk importing.

chezou commented 4 years ago

There can be another option like as_is which respects CsvReader parse with dialect.

TibsAtWork commented 4 years ago

PR https://github.com/treasure-data/td-client-python/pull/85 provides a proposed solution, by supporting dtypes and converters arguments similar to those used in Pandas when reading CSV. The default behaviour is still the same as it was before.

chezou commented 4 years ago

Resolved by #85