jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 220 forks source link

Sampling tests for parquet round trips #1519

Closed AnIrishDuck closed 1 year ago

AnIrishDuck commented 1 year ago

This uses sample-test, and sample-arrow2, which we built specifically for this purpose. See the README for why we felt quickcheck and proptest were unsuitable.

I'm not sure if we'd rather have the libraries be optional [dependencies] instead of [dev-dependencies] (which cannot be optional). I figured this should be behind a feature flag though.

When run exhaustively (see the commented TODO lines), this appears to unearth more errors in the parquet IO code. Issues appear to trigger with nesting and nullable fields in combination. Some examples:

"assertion failed: `(left == right)`\n 
left: `[Chunk { arrays: [ListArray[[{wttdx: -3458878700, yjsyrg: [None]}]]] }]`,\n
right: `[Chunk { arrays: [ListArray[[{wttdx: -3458878700, yjsyrg: None}]]] }]`"
"assertion failed: `(left == right)`\n  
left: `[Chunk { arrays: [ListArray[[[None]]]] }]`,\n 
right: `[Chunk { arrays: [ListArray[[None]]] }]`"

My prior experience with the def/rep level encoder obviously leads me to suspect that code. I know it was recently rewritten, but it's a very complex subject and I'm not shocked that it may need more work.

Let me know how I can help assist. In particular, the shrinking behavior in sample-arrow2 is suboptimal due to chained resampling and some implementation hacks that can probably be improved. I can definitely assist if you're playing around with this and are having trouble shrinking back to useful exemplars.

Setting the chunk length to a small value appears to be generating good counterexamples for now.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.01 :warning:

Comparison is base (8ee5ad8) 83.51% compared to head (1e624c0) 83.50%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1519 +/- ## ========================================== - Coverage 83.51% 83.50% -0.01% ========================================== Files 388 388 Lines 42024 42024 ========================================== - Hits 35096 35093 -3 - Misses 6928 6931 +3 ``` [see 5 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1519/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ritchie46 commented 1 year ago

it's a very complex subject and I'm not shocked that it may need more work.

Me neither. I don't understand this well enough to fix atm, so any help is great. I am working on a separate parquet implementation, but progress is slow.