ropensci / targets

Function-oriented Make-like declarative workflows for R
https://docs.ropensci.org/targets/
Other
916 stars 74 forks source link

Parquet targets do not support list-columns of ordered factors #1291

Closed lgaborini closed 3 months ago

lgaborini commented 3 months ago

Prework

Bug

Using {targets} 1.7.0, {tarchetypes} 0.9.0, {arrow} 16.1.0.

A bit of a niche issue!

I have a target that is made up by a tibble, and the tibble contains a list-column. It works fine if the elements of the list-column are factors:

library(targets)
tar_script({
    list(tar_target(data, tibble::tibble(x = list(factor(c("a", 
        "b"), levels = c("c", "b", "a"), ordered = FALSE), factor(c("b", 
        "c"), levels = c("c", "b", "a"), ordered = FALSE))), 
        format = "parquet"), tar_target(summary, print(data$x)))
})
tar_make()
#> ▶ dispatched target data
#> ● completed target data [0 seconds]
#> ▶ dispatched target summary
#> [[1]]
#> [1] a b
#> Levels: c b a
#> 
#> [[2]]
#> [1] b c
#> Levels: c b a
#> 
#> ● completed target summary [0 seconds]
#> ▶ ended pipeline [0.34 seconds]

Created on 2024-05-30 with reprex v2.1.0

If the factors are ordered, the Parquet write fails:

library(targets)
tar_script({
    list(tar_target(data, tibble::tibble(x = list(factor(c("a", 
        "b"), levels = c("c", "b", "a"), ordered = TRUE), factor(c("b", 
        "c"), levels = c("c", "b", "a"), ordered = TRUE))), format = "parquet"), 
        tar_target(summary, print(data$x)))
})
tar_make()
#> ▶ dispatched target data
#> ✖ errored target data
#> ✖ errored pipeline [0.34 seconds]
#> Error:
#> ! Error running targets::tar_make()
#> Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
#> Debugging guide: https://books.ropensci.org/targets/debugging.html
#> How to ask for help: https://books.ropensci.org/targets/help.html
#> Last error message:
#>     _store_ Invalid: Column data for field 0 with type list<item: dictionary<values=string, indices=int8, ordered=0>> is inconsistent with schema list<item: dictionary<values=string, indices=int8, ordered=1>>
#> Last error traceback:
#>     No traceback available.

No issues if the tibble/data.frame contains regular columns, either ordered or unordered:

library(targets)
tar_script({
    list(tar_target(data, tibble::tibble(x = factor(c("a", "b"), 
        levels = c("c", "b", "a"), ordered = TRUE)), format = "parquet"), 
        tar_target(data2, tibble::tibble(x = factor(c("a", "b"), 
            levels = c("c", "b", "a"), ordered = FALSE)), format = "parquet"), 
        tar_target(data3, data.frame(x = factor(c("a", "b"), 
            levels = c("c", "b", "a"), ordered = TRUE)), format = "parquet"), 
        tar_target(data4, data.frame(x = factor(c("a", "b"), 
            levels = c("c", "b", "a"), ordered = FALSE)), format = "parquet"), 
        tar_target(summary, {
            print(data$x)
            print(data2$x)
            print(data3$x)
            print(data4$x)
        }))
})
tar_make()
#> ▶ dispatched target data
#> ● completed target data [0 seconds]
#> ▶ dispatched target data2
#> ● completed target data2 [0 seconds]
#> ▶ dispatched target data3
#> ● completed target data3 [0 seconds]
#> ▶ dispatched target data4
#> ● completed target data4 [0 seconds]
#> ▶ dispatched target summary
#> [1] a b
#> Levels: c < b < a
#> [1] a b
#> Levels: c b a
#> [1] a b
#> Levels: c < b < a
#> [1] a b
#> Levels: c b a
#> ● completed target summary [0 seconds]
#> ▶ ended pipeline [0.44 seconds]

List-columns in data frames are ugly or non-functional, so I'm not trying those.

Thanks!

wlandau commented 3 months ago

This is an arrow issue, not a targets issue:

data <- tibble::tibble(
  x = list(
    factor(c("a", "b"), levels = c("c", "b", "a"), ordered = TRUE),
    factor(c("b", "c"), levels = c("c", "b", "a"), ordered = TRUE)
  )
)
arrow::write_parquet(data, tempfile())
#> Error: Invalid: Column data for field 0 with type list<item: dictionary<values=string, indices=int8, ordered=0>> is inconsistent with schema list<item: dictionary<values=string, indices=int8, ordered=1>>

Created on 2024-05-30 with reprex v2.1.0

Please file this as a bug report at https://github.com/apache/arrow/issues or ask for help on a general forum like Stack Overflow.