jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 220 forks source link

Fix list array parsing in pandas record json #1511

Closed AnIrishDuck closed 1 year ago

AnIrishDuck commented 1 year ago

The double-list construction was a mistake that prevents parsing of several valid types of pandas record json constructs.

Fix that by removing the nested lists and add a test verifying that it works.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01 :tada:

Comparison is base (3ab9b61) 83.45% compared to head (b8a4882) 83.46%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1511 +/- ## ========================================== + Coverage 83.45% 83.46% +0.01% ========================================== Files 388 388 Lines 42010 42003 -7 ========================================== - Hits 35059 35058 -1 + Misses 6951 6945 -6 ``` | [Impacted Files](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1511?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/io/json/read/deserialize.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1511?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2lvL2pzb24vcmVhZC9kZXNlcmlhbGl6ZS5ycw==) | `74.35% <100.00%> (+0.31%)` | :arrow_up: | | [src/io/json/read/infer\_schema.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1511?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2lvL2pzb24vcmVhZC9pbmZlcl9zY2hlbWEucnM=) | `92.97% <100.00%> (-0.19%)` | :arrow_down: | ... and [7 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1511/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ritchie46 commented 1 year ago

Thanks for the PR @AnIrishDuck. What was the behavior we had first that was flawed? For my understanding of the changes here.

AnIrishDuck commented 1 year ago

When trying to decode pandas-record json with list arrays, it would expect them to be nested within another list array. In other words, deserialization of ListArrays for pandas record JSON was pretty much generally broken.

We missed this on implementation because we were using FixedSizeList in most places internally.

AnIrishDuck commented 1 year ago

To elaborate (after looking at the tests), we only tested that nested list arrays worked, which hid the underlying issue (as the outer level of nesting that infer_schema would generate would still be properly handled by allocate_array)

ritchie46 commented 1 year ago

Right, Thanks!