r-lib / httr

httr: a friendly http package for R
https://httr.r-lib.org
Other
986 stars 1.99k forks source link

Forbidden with httr, but works with py requests #738

Closed MislavSag closed 1 year ago

MislavSag commented 1 year ago

I am trying to get release dates from US BLS website: https://www.bls.gov/bls/news-release/cpi.htm

When I send simple GET request I get 403 Error:

p = GET(
  "https://www.bls.gov/bls/news-release/cpi.htm",
  user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36")
)
status_code(p)
# [1] 403

but when I send the same request with python, it works as expected.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
}

p = requests.get("https://www.bls.gov/bls/news-release/cpi.htm", headers=headers)

p.status_code
# 200

Can't understand why I am forbidden with R, but not with py code :)

MislavSag commented 1 year ago

EDIT: It doesn't work with httr2:

req <- request("https://www.bls.gov/bls/news-release/cpi.htm") %>% 
  req_headers("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36") %>% 
  req_method("GET")
req %>% req_dry_run()
resp <- req_perform(req)
jennybc commented 1 year ago

I don't know what the equivalent is for requests (or if it exists). But you can put an httr call inside httr::with_verbose() (and likewise httr2::with_verbosity()) to see exactly what's going out over the wire. That can be extremely helpful in cases like this, especially if you can compare to similar info for requests.

MislavSag commented 1 year ago

I have just tried it, but I don't see nothing helpfull:

-> GET /bls/news-release/cpi.htm HTTP/1.1
-> Host: www.bls.gov
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
-> 
<- HTTP/1.1 403 Forbidden
<- Server: AkamaiGHost
<- Mime-Version: 1.0
<- Content-Length: 1325
<- Cache-Control: no-cache, no-store, must-revalidate
<- Pragma: no-cache
<- Expires: 0
<- Content-Type: text/html
<- Date: Wed, 26 Apr 2023 19:27:28 GMT
<- Connection: keep-alive
<- 
Response [https://www.bls.gov/bls/news-release/cpi.htm]
  Date: 2023-04-26 19:27
  Status: 403
  Content-Type: text/html
  Size: 1.32 kB
arunamalla commented 1 year ago

its working fine with both httr and httr2, in R. http calls from R or RStudio, highly depends on the underlying/default web browser settings, preferences. please make sure these are same in Python default env and R env

MislavSag commented 1 year ago

It was working after I set header parameter connection to keep alive. In request this is true by default.

jennybc commented 1 year ago

Just to close the loop, do you want to share the code that worked? Is this issue resolve now?

MislavSag commented 1 year ago

It is resolved.

Here is the code:

GET(
  "https://download.bls.gov/pub/time.series/cu/cu.item",
  add_headers('Connection' = 'keep-alive',
              "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36")
)