jamesturk / scrapelib

⛏ a library for scraping unreliable pages
https://jamesturk.github.io/scrapelib/
BSD 2-Clause "Simplified" License
208 stars 40 forks source link

Follow <meta> redirects #9

Closed schmod closed 11 years ago

schmod commented 11 years ago

scrapelib doesn't seem to follow <meta> redirects. While this is a somewhat old and non-standard way to do redirection in 2013, it's still out there on a few government sites.

Here's one example:

$ curl http://www.risch.senate.gov
<html>
<head>
<meta http-equiv="Refresh"
content="0;url=http://www.risch.senate.gov/public/">
</head>

There's an example on StackOverflow for how this could be implemented.

Currently also being discussed in unitedstates/congress-legislators#85

jamesturk commented 11 years ago

Interesting, hadn't considered this need before. I'm concerned about following it automatically though, is there a real advantage to that instead of just searching for the meta tag and/or just updating the URL you scrape?

jamesturk commented 11 years ago

seeing as this was fixed elsewhere in unitedstates/congress-legislators and can probably lead to unexpected behavior I'm going to close this without a fix