r-lib / httr2

Make HTTP requests and process their responses. A modern reimagining of httr.
https://httr2.r-lib.org
Other
235 stars 56 forks source link

Failure to download 0 byte file, in parallel: resp_has_body(x)) {: missing value where TRUE/FALSE needed #478

Closed PietrH closed 2 months ago

PietrH commented 3 months ago

I'm downloading a bunch of files in parallel, some turn out to be 0 bytes.

Example of file: https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5

This works;

zero_byte_file_url <- "https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5"

# Fetch the file
httr2::request(zero_byte_file_url) |>
  httr2::req_perform(path = basename(zero_byte_file_url))

# Check if we downloaded the file
file.exists(basename(zero_byte_file_url))
# Clean up after ourselves
file.remove(basename(zero_byte_file_url))

This works as well:

# Function to create a list of requests, with retries
create_file_requests <- function(urls) {
  purrr::map(urls,
             ~httr2::req_retry(
               httr2::request(.x),
               max_tries = 10)
  )
}

# Fetch the file
create_file_requests(zero_byte_file_url) %>% 
  httr2::req_perform_sequential(paths = basename(zero_byte_file_url))

# Check if we downloaded the file
file.exists(basename(zero_byte_file_url))
# Clean up after ourselves
file.remove(basename(zero_byte_file_url))

However, this doesn't work in parallel. Initially I thought it might be due to the retry not being allowed in parallel (although the documentation claims it would just get ignored), but it doesn't work with this omitted either:

# Function to create a list of requests, no retries this time! 
create_file_requests_no_retry <- function(urls) {
  purrr::map(urls, 
             ~httr2::request(.x)
  )
}

# Fetch the file, this fails
create_file_requests_no_retry(zero_byte_file_url) |>
  httr2::req_perform_parallel(paths = basename(zero_byte_file_url))
# Check if we downloaded the file
file.exists(basename(zero_byte_file_url))

I trigger a condition in resp_has_body() that doesn't have a clear message:

GET
https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5
Status: 200 OK
Content-Type: binary/octet-stream
Error in if (!resp_has_body(x)) { : missing value where TRUE/FALSE needed

reprex

# single request ----------------------------------------------------------

zero_byte_file_url <- "https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5"

# Fetch the file
httr2::request(zero_byte_file_url) |>
  httr2::req_perform(path = basename(zero_byte_file_url))
#> <httr2_response>
#> GET
#> https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5
#> Status: 200 OK
#> Content-Type: binary/octet-stream
#> Body: None

# Check if we downloaded the file
file.exists(basename(zero_byte_file_url))
#> [1] TRUE
# Clean up after ourselves
file.remove(basename(zero_byte_file_url))
#> [1] TRUE

# sequentially ------------------------------------------------------------

# Function to create a list of requests, with retries
create_file_requests <- function(urls) {
  purrr::map(urls,
             ~httr2::req_retry(
               httr2::request(.x),
               max_tries = 10)
  )
}

# Fetch the file
create_file_requests(zero_byte_file_url) |>
  httr2::req_perform_sequential(paths = basename(zero_byte_file_url))
#> [[1]]
#> <httr2_response>
#> GET
#> https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5
#> Status: 200 OK
#> Content-Type: binary/octet-stream
#> Body: None

# Check if we downloaded the file
file.exists(basename(zero_byte_file_url))
#> [1] TRUE
# Clean up after ourselves
file.remove(basename(zero_byte_file_url))
#> [1] TRUE

# in parallel, no retries this time ---------------------------------------

create_file_requests_no_retry <- function(urls) {
  purrr::map(urls, 
             ~httr2::request(.x)
  )
}

# Fetch the file, this fails
create_file_requests_no_retry(zero_byte_file_url) |>
  httr2::req_perform_parallel(paths = basename(zero_byte_file_url))
#> [[1]]
#> <httr2_response>
#> GET
#> https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5
#> Status: 200 OK
#> Content-Type: binary/octet-stream
#> Error in if (!resp_has_body(x)) {: missing value where TRUE/FALSE needed
# Check if we downloaded the file
file.exists(basename(zero_byte_file_url))
#> [1] FALSE

Created on 2024-06-07 with reprex v2.1.0

hadley commented 3 months ago

Somewhat more minimal reprex:

library(httr2)
url <- "https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5"

reqs <- list(request(url))
resps <- req_perform_parallel(reqs, paths = withr::local_tempfile())
resps[[1]]
#> <httr2_response>
#> GET
#> https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/frbla/2021/02/28/frbla_vp_20210228T190000Z_0xb.h5
#> Status: 200 OK
#> Content-Type: binary/octet-stream
#> Error in if (!resp_has_body(x)) {: missing value where TRUE/FALSE needed

Created on 2024-06-07 with reprex v2.1.0

Looks like the problem is that req_parallel() doesn't create the file is the body is zero bytes. This is probably due to some difference between the single and multithreaded curl API, and could be resolved in Performance$succeed by creating the file if it doesn't exist.

PietrH commented 3 months ago

I'm using sequential downloads as a fallback on failed parallel downloads anyway, so by adding an extra condition to the fallback I was able to workaround this issue.

Ideally req_perform_parallel() would just create the empty files, otherwise, an extra error message would be helpful so it's easier to figure out what's going wrong.

hadley commented 3 months ago

I think we can just fix this bug rather than emitting a message.