hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 4 forks source link

Revert to using arrow tables for `full` valid values grid in `check_tbl_values_required()` #37

Open annakrystalli opened 1 year ago

annakrystalli commented 1 year ago

Previously I have been using arror tables which seem more memory efficient and generally more performant to optimise check_tbl_values_required() which can be slow with larger files.

In d1e286133970c557429d4047e45a8b41da86b4d0 I reverted this because I discovered joins using arrow did not consider NA values as matches (as dplyr does by default), resulting in data being lost during inner joins that included NA values. (see issue reported here: https://github.com/apache/arrow/issues/14907)

Hopefully, this will at some point be resolved. Once it is, changes in d1e286133970c557429d4047e45a8b41da86b4d0 will need reverting to make the function more performant again.

annakrystalli commented 1 year ago

Discussion in arrow moved to separate issue https://github.com/apache/arrow/issues/37902