oblac / jodd

Jodd! Lightweight. Java. Zero dependencies. Use what you like.
https://jodd.org
BSD 2-Clause "Simplified" License
4.06k stars 724 forks source link

jodd.http.HttpBrowser.getPage() may not get correct charset #150

Closed shawnye closed 10 years ago

shawnye commented 10 years ago

jodd.http.HttpBrowser.getPage() may not get correct charset, and HttpRequest.charset(charset) it seems be of no veil. I have to convert charset myself as following code showed:

public class WebPageFetcher {
    private static Logger logger = LoggerFactory.getLogger(WebPageFetcher.class);

    private  HttpBrowser browser = new HttpBrowser();
    private String userAgent;

    private String charset;

    public String getUserAgent() {
        return userAgent;
    }

    public void setUserAgent(String userAgent) {
        this.userAgent = userAgent;
    } 

    public String getCharset() {
        return charset;
    }

    public void setCharset(String charset) {
        this.charset = charset;
    }

    public String fetch(String url){
        HttpRequest request = HttpRequest.get(url);
        if(StringUtil.isNotBlank(userAgent)){
            request.header("User-Agent",userAgent, true);
        }

//      request.charset(charset);//it seems be of no veil.
        browser.sendRequest(request); 

        if(browser.getHttpResponse() == null){
            logger.error("Error happened when getting page from url(no response):" + url); 
            return null;
        }

        int statusCode = browser.getHttpResponse().statusCode(); 
        if(statusCode >= 400){
            logger.error("Error happened when getting page from url:" + url);
            logger.error(browser.getHttpResponse().toString(false));
            return null;
        } 

        //force to use custom charset **I hope Jodd can handle this for me**
        if(StringUtil.isNotBlank(charset)){
            if(browser.getHttpResponse() != null){
                byte[] bodyBytes = browser.getHttpResponse().bodyBytes();
                if(bodyBytes != null){
                    try {
                        return new String(bodyBytes,charset);
                    } catch (UnsupportedEncodingException e) {
                        logger.error("fail to convert bytes", e); 
                    }
                }
            }
            return null;
        }else{
            return browser.getPage(); 
        }

    }
}

you can try the following page url without custom charset, it displays messy code for Chinese, http://blog.sina.com.cn/s/blog_6f2171a10100unux.html (correct title display like this chrome CSS广告过滤进阶设置)

but I found the page source contains header for correct charset: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Maybe the website server does not give correct charset prompt , can we custom charset using method like HttpRequest.charset(charset) ? Thank you.

igr commented 10 years ago

You are right, there is a problem with this page. The problem is because server does not send the charset in the request headers, but in the HTML. Http tool does not parse the returned HTML content, so it can't know what content type is set in HTML. (Maybe if we use Lagarto we can figure this out ;)

Anyway, you are right, adding method to force charset is a great idea - but it has to be set in the HttpResponse. Thank you!

igr commented 10 years ago

Except... this already exist ;)

HttpRequest request = HttpRequest.get("http://blog.sina.com.cn/s/blog_6f2171a10100unux.html");
HttpResponse response = request.send();

response.charset("utf-8");
String text = response.bodyText();

Just use charset() method after you received a response. Thats all :)

shawnye commented 10 years ago

yes, I should use response.charset(charset); instead of request.charset(charset); thank you!