Closed imfht closed 6 years ago
http://www.sdu.edu.cn do not have encoding info in headers but has encoding info in html meta. colly cannot autodetect encoding type from response. code snip
import ( "github.com/gocolly/colly" "fmt" ) func main() { c := colly.NewCollector( // Visit only domains: hackerspaces.org, wiki.hackerspaces.org //colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"), ) c.OnResponse(func(response *colly.Response) { fmt.Println(string(response.Body)) }) c.Visit("http://www.sdu.edu.cn") }
get output
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> <meta name="description" content="ɽ����ѧ,ɽ��,ɽ����ѧ�ٷ���վ,ɽ�����,SDU" /> <meta name="keywords" content="ɽ����ѧ,ɽ��,ɽ����ѧ�ٷ���վ,ɽ�����,SDU" /> <meta name="Copyright" content="Copyright (c) 2010 www.sdu.edu.cn All Rights Reserved." /> <title>��ӭ����ɽ����ѧ��ҳ</title> <link href="../2010/images/style2.css" rel="stylesheet" type="text/css" /> <link href="../2010/images/ad2.css" rel="stylesheet" type="text/css" /> <script type="text/javascript" src="../2010/images/ad.js" ></script>
can colly auto get encoding from header or html body just like scrapy?
Try to use Colly's character detection heuristics: set c.DetectCharset to true before scraping.
c.DetectCharset
true
https://godoc.org/github.com/gocolly/colly#DetectCharset
http://www.sdu.edu.cn do not have encoding info in headers but has encoding info in html meta. colly cannot autodetect encoding type from response. code snip
get output
can colly auto get encoding from header or html body just like scrapy?