mgirlich / tibblify

Rectangle Nested Lists
https://mgirlich.github.io/tibblify/
GNU General Public License v3.0
67 stars 2 forks source link

tib_df and empty array #182

Open krlmlr opened 1 year ago

krlmlr commented 1 year ago

I'm seeing weird references to "colmajor" when an empty JSON array [] is parsed by a tib_df() . What am I doing wrong?

CC @TSchiefer.

library(tibblify)

json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [] }]'
nested_list <- jsonlite::fromJSON(json)

spec <- tibblify::guess_tspec(nested_list)
spec
#> tspec_df(
#>   tib_int("a"),
#>   tib_df(
#>     "b",
#>     tib_int("c", required = FALSE),
#>     tib_int("d", required = FALSE),
#>   ),
#> )
tibblify::tibblify(nested_list, spec)
#> Error in `tibblify::tibblify()`:
#> ! Problem while tibblifying `x$b[[2]]$c`
#> Caused by error in `withCallingHandlers()`:
#> ! Field is absent in colmajor.
#> ℹ In file 'add-value.c' at line 395.
#> ℹ This is an internal error that was detected in the base package.
#> Backtrace:
#>     ▆
#>  1. ├─tibblify::tibblify(nested_list, spec)
#>  2. │ └─rlang::try_fetch(...)
#>  3. │   ├─base::tryCatch(...)
#>  4. │   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  5. │   │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  6. │   │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  7. │   └─base::withCallingHandlers(...)
#>  8. └─rlang:::stop_internal_c_lib(...)
#>  9.   └─rlang::abort(message, call = call, .internal = TRUE, .frame = frame)

json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [{ "c": 1 }] }]'
nested_list <- jsonlite::fromJSON(json)

spec <- tibblify::guess_tspec(nested_list)
spec
#> tspec_df(
#>   tib_int("a"),
#>   tib_df(
#>     "b",
#>     tib_int("c"),
#>     tib_int("d", required = FALSE),
#>   ),
#> )
tibblify::tibblify(nested_list, spec)
#> Error in `tibblify::tibblify()`:
#> ! Field d is required but does not exist in `x$b[[2]]`.
#> ℹ For `.input_form = "colmajor"` every field is required.
#> Backtrace:
#>      ▆
#>   1. ├─tibblify::tibblify(nested_list, spec)
#>   2. │ └─rlang::try_fetch(...)
#>   3. │   ├─base::tryCatch(...)
#>   4. │   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   5. │   │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   6. │   │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   7. │   └─base::withCallingHandlers(...)
#>   8. └─tibblify:::stop_required_colmajor(`<named list>`)
#>   9.   └─tibblify:::tibblify_abort(msg)
#>  10.     └─cli::cli_abort(..., class = "tibblify_error", .envir = .envir)
#>  11.       └─rlang::abort(...)

json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": null }]'
nested_list <- jsonlite::fromJSON(json)

spec <- tibblify::guess_tspec(nested_list)
spec
#> tspec_df(
#>   tib_int("a"),
#>   tib_df(
#>     "b",
#>     tib_int("c", required = FALSE),
#>     tib_int("d", required = FALSE),
#>   ),
#> )
tibblify::tibblify(nested_list, spec)
#> # A tibble: 2 × 2
#>       a                  b
#>   <int> <list<tibble[,2]>>
#> 1     1            [2 × 2]
#> 2     2

Created on 2023-04-17 with reprex v2.0.2

mgirlich commented 1 year ago

This is because the code path for colmajor is used when the input is a data frame. This makes the error message indeed quite confusing. Regarding the errors themselves:

  1. Empty tibble
json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [] }]'
nested_list <- tibble::as_tibble(jsonlite::fromJSON(json))
nested_list
#> # A tibble: 2 × 2
#>       a b           
#>   <int> <list>      
#> 1     1 <df [2 × 2]>
#> 2     2 <df [0 × 0]>

Created on 2023-07-07 with reprex v2.0.2

In the colmajor format (and therefore data frames) all columns are required. So, to me it kind of makes sense to error here but it is also quite confusing.

  1. No column d
json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [{ "c": 1 }] }]'
nested_list <- tibble::as_tibble(jsonlite::fromJSON(json))
nested_list
#> # A tibble: 2 × 2
#>       a b           
#>   <int> <list>      
#> 1     1 <df [2 × 2]>
#> 2     2 <df [1 × 1]>

Created on 2023-07-07 with reprex v2.0.2

Basically the same case as before.

  1. NULL
json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": null }]'
nested_list <- tibble::as_tibble(jsonlite::fromJSON(json))
nested_list
#> # A tibble: 2 × 2
#>       a b           
#>   <int> <list>      
#> 1     1 <df [2 × 2]>
#> 2     2 <NULL>

Created on 2023-07-07 with reprex v2.0.2

This works because NULL gets a special treatment as the missing value of a list.

mgirlich commented 1 year ago

But it is also a bit annoying that all examples work with the same spec if using simplifyDataFrame = FALSE.