gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.25k stars 1.76k forks source link

Cannot auto guess encoding from html body on www.sdu.edu.cn #155

Closed imfht closed 6 years ago

imfht commented 6 years ago

http://www.sdu.edu.cn do not have encoding info in headers but has encoding info in html meta. colly cannot autodetect encoding type from response. code snip

import (
    "github.com/gocolly/colly"
    "fmt"
)

func main() {
    c := colly.NewCollector(
        // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
        //colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
    )
    c.OnResponse(func(response *colly.Response) {
        fmt.Println(string(response.Body))
    })
    c.Visit("http://www.sdu.edu.cn")
}

get output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<meta name="description" content="ɽ����ѧ,ɽ��,ɽ����ѧ�ٷ���վ,ɽ�����,SDU" />
<meta name="keywords" content="ɽ����ѧ,ɽ��,ɽ����ѧ�ٷ���վ,ɽ�����,SDU" />
<meta name="Copyright" content="Copyright (c) 2010 www.sdu.edu.cn All Rights Reserved." />
<title>��ӭ����ɽ����ѧ��ҳ</title>
<link href="../2010/images/style2.css" rel="stylesheet" type="text/css" />
<link href="../2010/images/ad2.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="../2010/images/ad.js" ></script>

can colly auto get encoding from header or html body just like scrapy?

asciimoo commented 6 years ago

Try to use Colly's character detection heuristics: set c.DetectCharset to true before scraping.

https://godoc.org/github.com/gocolly/colly#DetectCharset