jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.88k stars 2.17k forks source link

Support HTML redirect (Meta tag refresh) ? #371

Closed sunng87 closed 3 weeks ago

sunng87 commented 10 years ago

We just got this kind of redirection when parsing html files. Some web sites doesn't use standard HTTP redirection. Instead, they use browser redirection:

<html>
 <head> 
  <meta http-equiv="Refresh" content="0;URL=http://sports.sina.com.cn/j/2013-11-14/22386885017.shtml" /> 
 </head> 
 <body></body>
</html>

We can check the Refresh meta tag: If the content URL doesn't equal to base URL, just treat it as a redirection. This can be done within HTTPConnection. I found a python command line utility httpie supports this feature. It will be nice to have this feature in Jsoup.

jhy commented 10 years ago

I think it'd have to be opt-in optional. Like Connection.followMetaRedirects(true)

zhuhw commented 8 years ago

It does not work for http://baidu.com either. The HTML I got is:

<html>
 <head>
  <meta http-equiv="refresh" content="0;url=http://www.baidu.com/"> 
 </head>
 <body> 
 </body>
</html>

I used Connection.followRedirects(true) in version 1.8.3.

jhy commented 3 weeks ago

Closing, no plan to implement.