paleolimbot / nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
4 stars 0 forks source link

Draft `struct SchemaView` + using it for schema validation #5

Closed paleolimbot closed 2 years ago

paleolimbot commented 2 years ago

This is a PR drafting the design for a data structure that holds information about a data type as communicated by a struct ArrowSchema. The structure, tentatively called a struct ArrowSchemaView, can be stack or statically allocated. Usage could be something like:

struct ArrowSchema schema;
struct ArrowSchemaView schema_view;
struct ArrowError error;

arrow::ExportType(*arrow::duration(arrow::TimeUnit::SECOND), &schema);

int result = ArrowSchemaViewInit(&schema_view, &schema, &error)
if (result != ARROWC_OK) {
  // do something with ArrowErrorMessage(error)
}

// do something with
// schema_view.n_buffers
// schema_view.data_type
// schema_view.storage_data_type
// schema_view.time_unit

Some thoughts:

I'm not sure how to handle extension types...it may have to be separate from the data_type/storage_data_type since in theory one could have an extension type with a date/time storage type (and it seems confusing to have three levels of data type). It seems within scope to have an extension_name member (that is a struct ArrowStringView).

There is some obvious repetition in the implementation and tests...I was going for completeness rather than proper abstraction to make sure I was able to catch all the fields that were needed.

One could also abandon the stack-allocatability route and have heap allocated fields and require the caller to do something like schema_view.release(&schema_view) to clean up...I didn't think there were too many fields such that the struct is unwieldy and not having to manage that memory is a helpful interface perk.

One could also provide field getter functions rather than advertising the use of struct fields directly. That seems like it might be easier to maintain backwards compatibility if adding some verbosity to the interface.

codecov-commenter commented 2 years ago

Codecov Report

Merging #5 (c03bbcd) into main (64cf5b5) will increase coverage by 15.87%. The diff coverage is 97.35%.

@@             Coverage Diff             @@
##             main       #5       +/-   ##
===========================================
+ Coverage   78.19%   94.07%   +15.87%     
===========================================
  Files           2        4        +2     
  Lines         133      624      +491     
  Branches        7       16        +9     
===========================================
+ Hits          104      587      +483     
- Misses         22       27        +5     
- Partials        7       10        +3     
Impacted Files Coverage Δ
src/arrowc/schema.c 73.98% <75.00%> (-3.01%) :arrow_down:
src/arrowc/schema_view.c 98.86% <98.86%> (ø)
src/arrowc/metadata.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 64cf5b5...c03bbcd. Read the comment docs.

pitrou commented 2 years ago

API looks generally good IMHO. Seems to miss some metadata iteration utility, though. Also, I don't understand why the PR title mentions schema validation. I don't see a validation method besides syntax validation of the format?

paleolimbot commented 2 years ago

I don't see a validation method

I added this in and renamed the methods so that we have parse == walk the format string extracting information (or erroring when this fails) and validate == check anything else. ArrowSchemaViewInit() does both.

How are list, struct, etc. types handled here, since that's effectively the same issue?

ArrowSchemaViewInit() now walks the children and dictionary if they exist and does the validation step, although maybe they shouldn't (the caller will probably have to do this anyway).

I implemented enough metadata utilities to have a struct ArrowStringView for the extension_name/metadata. I wasn't aware that you could have a storage type that is an extension type in the C Data interface, so I don't know how that would be handled.