otsaloma / dataiter

Python classes for data manipulation
https://dataiter.readthedocs.io/
MIT License
25 stars 0 forks source link

Represent strings as objects? #11

Closed otsaloma closed 2 years ago

otsaloma commented 2 years ago

When reading in data with arbitrarily long strings, dataiter.DataFrame will use a huge amount of memory due to the 4 byte length NumPy Unicode strings and the fixed length representation (all elements allocated to match the longest one). We might need to either (1) use object conditionally when a column has long strings or (2) always use object, if possible, or (3) use bytes instead via a custom dtype.

otsaloma commented 2 years ago

0.28 allows reading only selected columns from file and setting dtypes upon reading, thus removing unavoidable intermediate string representation, maybe that'll do. Some NumPy functions don't deal well with objects, so it's something to generally avoid.