r-lib / urlchecker

Run CRAN URL checks from older versions of R
https://urlchecker.r-lib.org/
GNU General Public License v3.0
45 stars 5 forks source link

404 but website exists when running in parallel #15

Open MichaelChirico opened 3 years ago

MichaelChirico commented 3 years ago

As seen at https://github.com/jimhester/lintr/issues/828:

https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp triggers a 404 but I'm not having any issue navigating there in Firefox nor Chrome (also tested Chrome Incognito). Not sure what to make of the issue.

MichaelChirico commented 3 years ago

It seems to be an issue about running in parallel -- urlchecker::url_check(parallel=FALSE) passes.

jimhester commented 3 years ago

I get a 404 in command line curl as well. It seems like that website doesn't properly support HEAD requests.

curl -I 'https://marketplace.visualstudio.com/items\?itemName\=REditorSupport.r-lsp'
HTTP/2 404

When parallel = FALSE the code uses R's built-in curlGetHeaders() function https://github.com/r-lib/urlchecker/blob/022f0b04ac2f56daee668a9ed54868a829734728/inst/tools/urltools.R#L688

But actually I get a 404 from that one as well.

curlGetHeaders("https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp")
 [1] "HTTP/2 404 \r\n" 

I am not sure why this is not showing up in both cases then, possibly it is a bug in the way the output is being handled in tools package?

MichaelChirico commented 3 years ago

I am also getting 404 from curl -I and curlGetHeaders() :thinking: but still not getting any error from url_check(parallel=FALSE):

trace(urlchecker::url_check, at=3L, quote({
    res <- tools$check_url_db(db[grepl("visualstudio", db$URL), ], parallel = parallel, pool = pool, 
    verbose = progress)
    dput(res)
    cat("Done\n")
}))

Then

urlchecker::url_check(parallel = FALSE, progress=FALSE)
# structure(list(URL = character(0), From = list(), Status = character(0), 
#     Message = character(0), New = character(0), CRAN = character(0), 
#     Spaces = character(0), R = character(0)), row.names = integer(0), class = c("check_url_db", 
# "data.frame"))
# Done

vs with parallel=TRUE:

urlchecker::url_check(parallel = TRUE, progress=FALSE)
# structure(list(URL = "https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp", 
#     From = list(`https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp` = "README.md"), 
#     Status = "404", Message = "Not Found", New = "", CRAN = "", 
#     Spaces = "", R = ""), row.names = c(NA, -1L), class = c("check_url_db", 
# "data.frame"))
# Done

Same result when tracing to use tools:::check_url_db() instead:

trace(urlchecker::url_check, at=3L, quote({
    res <- tools:::check_url_db(db[grepl("visualstudio", db$URL), ])
    dput(res)
    cat("Done\n")
}))
urlchecker::url_check(progress=FALSE)
# structure(list(URL = character(0), From = list(), Status = character(0), 
#     Message = character(0), New = character(0), CRAN = character(0), 
#     Spaces = character(0), R = character(0)), row.names = integer(0), class = c("check_url_db", 
# "data.frame"))
# Done
MichaelChirico commented 3 years ago

OK I see now... tools:::check_url_db() runs .check_http_A() which does get curlGetHeaders() with status 404.

But then it follows up to run .curl_GET_status():

https://github.com/wch/r-source/blob/a4efc0c972d4aede0258348fd7ed6b0d7b27dd32/src/library/tools/R/urltools.R#L505

Which goes on to succeed. Why it succeeds is above my head (setting cookies maybe?):

https://github.com/wch/r-source/blob/a4efc0c972d4aede0258348fd7ed6b0d7b27dd32/src/library/tools/R/urltools.R#L786-L815

gaborcsardi commented 2 years ago

What happens is that the base R functions tries a GET request for all the URLs that were not 200, and if the GET request is 200, then that will be used for the result. We should probably do the same in urlchecker.