Closed otsaloma closed 2 years ago
0.28 allows reading only selected columns from file and setting dtypes upon reading, thus removing unavoidable intermediate string representation, maybe that'll do. Some NumPy functions don't deal well with objects, so it's something to generally avoid.
When reading in data with arbitrarily long strings,
dataiter.DataFrame
will use a huge amount of memory due to the 4 byte length NumPy Unicode strings and the fixed length representation (all elements allocated to match the longest one). We might need to either (1) use object conditionally when a column has long strings or (2) always use object, if possible, or (3) use bytes instead via a custom dtype.