r-lib / httr2

Make HTTP requests and process their responses. A modern reimagining of httr.
https://httr2.r-lib.org
Other
237 stars 57 forks source link

Zero-length values (e.g., `list()`) output by `resp_body_json` causing zero-length tibbles #552

Closed awpsoras closed 1 hour ago

awpsoras commented 5 hours ago

In processing JSON data that contained a list entry, empty values were converted into empty list() values by the resp_body_json function. This resulted in a 0-length tibble output by dplyr::bind_rows. I embarked on an (interesting) rabbit hole to figure out what was causing these zero-row tibbles.

Here is the forum post / commentary, and here it the original rmarkdown where I explored the issue prior to posting. Forum Link: https://forum.posit.co/t/json-data-and-bind-rows-unnest-bug-report/191581 Rpubs Link: https://rpubs.com/seadoo/bind_rows_and_json

Problem data (derived from list of lists structure output from resp_body_json:

some_list <- list(
  list(
    id = 001,
    name = "bob",
    age = 23,
    country = NULL,
    items = list("apple", "banana", "pear")
  ),
  list(
    id = 002,
    name = "sam",
    age = NULL,
    country = NULL,
    items = list()
  ),
  list(
    id = 003,
    name = "joe",
    age = NULL,
    country = NULL,
    items = list()
  )
)

Desired (tidy) output:

# A tibble: 5 × 5
     id name    age country items 
  <dbl> <chr> <dbl> <lgl>   <chr> 
1     1 bob      23 NA      apple 
2     1 bob      23 NA      banana
3     1 bob      23 NA      pear  
4     2 sam      NA NA      NA    
5     3 joe      NA NA      NA  

Current solution and comparison to data.table::rbindlist:

some_list |>
  purrr::map_depth(2, \(.x) if(length(.x) == 0) NA else .x) |>
  dplyr::bind_rows() |>
  tidyr::unnest(items, keep_empty = TRUE)

some_list |>
  data.table::rbindlist() |>
  tidyr::unnest(items, keep_empty = TRUE)

I think an additional argument to resp_body_json to handle length-zero values (and optionally also replace NULL with NA to retain all-null columns) would greatly improve the utility of the tidyverse when dealing with JSON data.

I'm happy to work on this further if you believe it is a valuable addition.

hadley commented 3 hours ago

Can you please provide a reprex that demonstrates the problem? resp_body_json() is a pretty thin wrapper around jsonlite::fromJSON() so it's fairly unlikely this is a httr2 issue at heart.

awpsoras commented 3 hours ago

Yes, the example is reproduced extensively in both of the links. Did you mean JSON as well? The structure of some_list is from my actual use case where I used httr2 to process data from an API using resp_body_json.

some_list as JSON:

{
  "results": [
  {
    "id": 1,
    "name": "bob",
    "age": 23,
    "country": null,
    "items": ["apple", "banana", "pear"]
  },
  {
    "id": 2,
    "name": "sam",
    "age": null,
    "country": null,
    "items": []
  },
  {
    "id": 3,
    "name": "joe",
    "age": null,
    "country": null,
    "items": []
  }
  ]
}

Here is the same example below:

some_list <- list(
    list(
        id = 001,
        name = "bob",
        age = 23,
        country = NULL,
        items = list("apple", "banana", "pear")
    ),
    list(
        id = 002,
        name = "sam",
        age = NULL,
        country = NULL,
        items = list()
    ),
    list(
        id = 003,
        name = "joe",
        age = NULL,
        country = NULL,
        items = list()
    )
)

some_list |>
    dplyr::bind_rows()
#> # A tibble: 3 × 4
#>      id name    age items    
#>   <dbl> <chr> <dbl> <list>   
#> 1     1 bob      23 <chr [1]>
#> 2     1 bob      23 <chr [1]>
#> 3     1 bob      23 <chr [1]>

Basically, the default output of resp_body_json is incompatible with the rest of the tidyverse whenever empty lists are present in the original JSON without carefully removing all of the length-zero entries. I would love for that to not be the case!

Since I do not expect the base definitions of bind_rows, tibble or vctrs to change, it seems like the best option would be to make the interpreted JSON more compatible with the tidyverse, either by adding helpers in httr2 or in jsonlite. Otherwise fixes like the solution in my original post are required.

Edit: Interestingly, piping the JSON string into jsonlite::fromJSON() directly does replace the nulls and empty lists correctly. When using resp_body_json, the nulls and empty lists remain in the output.

'{
  "results": [
  {
    "id": 1,
    "name": "bob",
    "age": 23,
    "country": null,
    "items": ["apple", "banana", "pear"]
  },
  {
    "id": 2,
    "name": "sam",
    "age": null,
    "country": null,
    "items": []
  },
  {
    "id": 3,
    "name": "joe",
    "age": null,
    "country": null,
    "items": []
  }
  ]
}' |>
  jsonlite::fromJSON()
#> $results
#>   id name age country               items
#> 1  1  bob  23      NA apple, banana, pear
#> 2  2  sam  NA      NA                    
#> 3  3  joe  NA      NA
hadley commented 1 hour ago

This doesn't seem like it's related to httr2?

awpsoras commented 1 hour ago

After more testing, it does appear more that the httr2 resp_body_json function is the culprit here. This pipeline works fine with the original JSON data on my machine (which I am unfortunately unable to share, hence the reconstruction above).

resp |>
  resp_body_string() |>
  jsonlite::fromJSON() |>
  pluck("results") |>
  bind_rows() |>
  as_tibble()

# Output, tibble with 25 cols, one list-col as expected

resp |>
  resp_body_json() |>
  pluck("results") |>
  bind_rows()

# Output, tibble with 0 cols and 0 rows

There is a difference in the output of jsonlite::fromJSON() and httr2::resp_body_json() which to me indicates that it is a httr2 issue, right?

Or alternatively it's an issue with the resp class since it looks like resp_body_json stored the result in the class before returning it

awpsoras commented 1 hour ago

Yup, the difference is that jsonlite::fromJSON() has the simplifyVector = TRUE as default, while httr2 does not. So you get different results if you use httr2::resp_body_json vs. jsonlite::fromJSON. Here is a full example if this helps you.

json <- '{
  "results": [
    {
      "id": 1,
      "name": "bob",
      "age": 23,
      "country": null,
      "items": ["apple", "banana", "pear"]
    },
    {
      "id": 2,
      "name": "sam",
      "age": null,
      "country": null,
      "items": []
    },
    {
      "id": 3,
      "name": "joe",
      "age": null,
      "country": null,
      "items": []
    }
  ]
}' 

# using default params
jsonlite::fromJSON(json)
#> $results
#>   id name age country               items
#> 1  1  bob  23      NA apple, banana, pear
#> 2  2  sam  NA      NA                    
#> 3  3  joe  NA      NA

# using httr2 params
jsonlite::fromJSON(json, simplifyVector = FALSE)
#> $results
#> $results[[1]]
#> $results[[1]]$id
#> [1] 1
#> 
#> $results[[1]]$name
#> [1] "bob"
#> 
#> $results[[1]]$age
#> [1] 23
#> 
#> $results[[1]]$country
#> NULL
#> 
#> $results[[1]]$items
#> $results[[1]]$items[[1]]
#> [1] "apple"
#> 
#> $results[[1]]$items[[2]]
#> [1] "banana"
#> 
#> $results[[1]]$items[[3]]
#> [1] "pear"
#> 
#> 
#> 
#> $results[[2]]
#> $results[[2]]$id
#> [1] 2
#> 
#> $results[[2]]$name
#> [1] "sam"
#> 
#> $results[[2]]$age
#> NULL
#> 
#> $results[[2]]$country
#> NULL
#> 
#> $results[[2]]$items
#> list()
#> 
#> 
#> $results[[3]]
#> $results[[3]]$id
#> [1] 3
#> 
#> $results[[3]]$name
#> [1] "joe"
#> 
#> $results[[3]]$age
#> NULL
#> 
#> $results[[3]]$country
#> NULL
#> 
#> $results[[3]]$items
#> list()

Created on 2024-09-20 with reprex v2.1.1