congress.gov returning 403 for `rvest::read_html()`

judgelord / congressionalrecord

Scrape, parse, and analyze the Congressional Record

https://judgelord.github.io/congressionalrecord/

Other

5 stars 0 forks source link

congress.gov returning 403 for `rvest::read_html()` #1

Open judgelord opened 8 months ago

judgelord commented 8 months ago

It looks like congress.gov is blocking whatever protocol rvest uses. I'm not sure what to do about this and don't have time to dig in right now, but I will try to figure it out.

> rvest::read_html("https://www.un.org/en/")
{html_document}
<html dir="ltr" lang="en">
[1] <head profile="http://www.w3.org/1999/xhtml/vocab">\n<meta charset="utf-8">\n<meta http- ...
[2] <body class="html front not-logged-in one-sidebar sidebar-first page-node i18n-en">\r\n\ ...
> rvest::read_html("https://www.congress.gov/")
Error in open.connection(x, "rb") : HTTP error 403.

ReneRejonP commented 7 months ago

Hi @judgelord, Thanks for developing this package! It's great! I'm trying to use it to scrape the congressional records and do some text mining for an academic article. Unfortunately, faced this same issue and have no idea how to fix it. Would appreciate any updates! Thanks again for developing this!

judgelord commented 7 months ago

If you want to help, you could test out alternative web scraping packages in R. I can replace the rvest method if another method works.

ReneRejonP commented 7 months ago

For sure! I’ll spend a few more hours on this next week. If I find another method, I’ll let you know!

judgelord commented 6 months ago

Update: it seems that congress.gov is no longer blocking us

Nuohai-muxi commented 1 month ago

I wrote a python code to substitute the scraper.

judgelord commented 1 month ago

I wrote a python code to substitute the scraper.

@Nuohai-muxi could you post a link to a repo?

judgelord commented 1 month ago

FWIW rvest::read_html("https://www.congress.gov") works --- if there are errors with this package's functions returning 403 errors, it may be due to backslashes at the end of URLs, which seem to make congress.gov return a 403. I will investigate.

Nuohai-muxi commented 1 month ago

@judgelord https://github.com/Nuohai-muxi/scraper-for-US-congress