gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

fixCharset() also has a BUG. the rseponse.header[Content-Type]=' text/html', not contain charset. The real charset is gbk. #73

Closed tx991020 closed 6 years ago

tx991020 commented 6 years ago

package main

import ( "fmt"

"github.com/gocolly/colly"

)

func main() { c := colly.NewCollector()

// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    // Print link
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
})

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.Headers)
})
c.OnResponse(func(r *colly.Response) {
    fmt.Println("Visited", r.Headers)
})

c.Visit("https://weibo.cn/repost/FByvKgel6?uid=6049100503&rl=1")

} Visiting &map[User-Agent:[colly - https://github.com/gocolly/colly]] Visited &map[Content-Type:[text/html] Connection:[keep-alive] Vary:[Accept-Encoding] Expires:[Sat, 26 Jul 1997 05:00:00 GMT] Dpool_header:[luna139] Pragma:[no-cache] Sina-Lb:[aGEuMjAyLmcxLnloZy5sYi5zaW5hbm9kZS5jb20=] Server:[nginx/1.6.1] Date:[Thu, 28 Dec 2017 13:42:51 GMT] Cache-Control:[no-cache, must-revalidate] Sina-Ts:[N2FiMjljY2UgMCAxIDEgMiA2Cg==]] Link found: "\xb9ر\xd5" -> javascript:history.go(-1); Link found: "" -> javascript:; Link found: "\xbb\xbbһ\xd5\xc5" -> javascript:; Link found: "\xb5\xc7¼" -> javascript:; Link found: "\xb5\xda\xc8\xfd\xb7\xbd\xd5ʺ\xc5" -> https://passport.weibo.cn/signin/other?r=http%3A%2F%2Fweibo.cn Link found: "ע\xb2\xe1\xd5ʺ\xc5" -> http://m.weibo.cn/reg/index?&vt=4&wm=3349&wentry=&backURL=http%3A%2F%2Fweibo.cn Link found: "\xcd\xfc\xbc\xc7\xc3\xdc\xc2\xeb" -> https://passport.weibo.cn/forgot/forgot?entry=wapsso&from=0 Link found: "ȡ\xcf\xfb" -> javascript:; Link found: "\xd1\xe9֤\xc2\xeb\xb5\xc7¼" -> javascript:; Link found: "\xb9ر\xd5" -> javascript:history.go(-1); Link found: "ȷ\xc8\xcf" -> javascript:; Link found: "ʹ\xd3\xc3\xc6\xe4\xcb\xfb\xd5ʺŵ\xc7¼" -> javascript:;

asciimoo commented 6 years ago

@tx991020 thanks for the report. What do you suggest to solve the problem? What about using https://github.com/saintfish/chardet ?

tx991020 commented 6 years ago

This solution is @imroc/req told me, but I'm not sure if this is the best way.
You can also refer to Jsoup https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/DataUtil.java

package encoding

import ( "log" "net/http" "regexp" "strings"

iconv "github.com/djimenez/iconv-go"
"github.com/imroc/req"

)

func UTF8(resp *req.Resp) (s string, err error) { encoding := Guess(resp) if encoding == "utf-8" { s = resp.String() return } s, err = iconv.ConvertString(resp.String(), encoding, "utf-8") return }

var regMeta = regexp.MustCompile(<meta[^>]+charset=["']([^"']+)['"][^>]*>)

func getEncodingFromBody(body string) string { result := regMeta.FindStringSubmatch(body) if len(result) < 2 { return "" } encoding := result[1] if len(encoding) == 0 { return "" } return strings.ToLower(encoding) }

var regCharset = regexp.MustCompile(charset=(\S+))

func getEncodingFromHeader(header http.Header) string { contentType := header.Get("Content-Type") if contentType == "" { return "" } result := regCharset.FindStringSubmatch(contentType) if len(result) < 2 { return "" } encoding := result[1] if len(encoding) == 0 { return "" } return strings.ToLower(encoding) }

func guess(resp *req.Resp) string { encoding := getEncodingFromHeader(resp.Response().Header) if encoding != "" { log.Println("encoding header:", encoding) return encoding }

encoding = getEncodingFromBody(resp.String())
if encoding != "" {
    log.Println("encoding body:", encoding)
    return encoding
}
return "utf-8"

}

func Guess(resp *req.Resp) string { encoding := guess(resp) if encoding == "gb2312" { return "gbk" } return encoding }

asciimoo commented 6 years ago

I'd avoid to parse html with regexp. What are the benefits of your solution against chardet?

tx991020 commented 6 years ago

It can save time, according to the request header or return to judge the charset, rather than directly guess charset using chardet. this process will spend a lot of time

asciimoo commented 6 years ago

We should measure if chardet is that slow.

asciimoo commented 6 years ago

the integration of chardet would be pretty straight forward:

--- a/colly.go
+++ b/colly.go
@@ -30,6 +30,7 @@ import (

        "github.com/PuerkitoBio/goquery"
        "github.com/kennygrant/sanitize"
+       "github.com/saintfish/chardet"
        "github.com/temoto/robotstxt"
 )

@@ -960,8 +961,15 @@ func randomBoundary() string {

 func (r *Response) fixCharset() {
        contentType := strings.ToLower(r.Headers.Get("Content-Type"))
        if !strings.Contains(contentType, "charset") {
-               return
+               d := chardet.NewTextDetector()
+               r, err := d.DetectBest(r.Body)
+               if err != nil {
+                       return
+               }
+               contentType = r.Charset
        }
        if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
                return
asciimoo commented 6 years ago

As an initial solution I've added chardet which can be enabled by setting Collector.DetectCharset to true. If you have a better solution please open a pr - preferably with tests/benchmarks.