static-frame / arraykit

Python C Extensions for StaticFrame
Other
8 stars 2 forks source link

`delimited_to_arrays` improvements #70

Closed flexatone closed 2 years ago

flexatone commented 2 years ago

Notes from SF integratoin.

  1. If file_like is a function, not an iterable, we seg fault
  2. Might support only reading-in a certain number of fields per CPL, as well as skipping a leading number of fields; this will permit better columns type evaluation and, with second pass, permit extracting the apex values. Skipping leading is needed as we are already seeing cases where the apex values are affecting type evaluation.
  3. String of numerals with '-' found in non-leading positions are being interpreted as ints in type parsing; must require '-' in leading position.
flexatone commented 2 years ago

Parsing columns as axis-0 CPLs is a little tricky, as the first part of each line (a row) might have "apex" values that should not be considered in type evaluation, and type evaluation is critical for columns as they are not provided by parameter.

Using splt to process the string records, then using iterable_str_to_array_1d is sub-optimal, as we loose all quoting functionality that is in delimited_to_arrays. We also create number of unneeded string objects.

Using delimited_to_arrays as dtype str and then converting to list is also very wasteful, particularly if we have to go from fixed width unicode arrays to Python strings and then back to our final array.

The best option seems to be to create a new AK function, str_to_array_1d, that can take a sliced but not split string, and reuse all the normal quote processing and configuration of delimited_to_arrays.