Closed tx991020 closed 6 years ago
@tx991020 thanks for the report. What do you suggest to solve the problem? What about using https://github.com/saintfish/chardet ?
This solution is @imroc/req told me, but I'm not sure if this is the best way.
You can also refer to Jsoup https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/DataUtil.java
package encoding
import ( "log" "net/http" "regexp" "strings"
iconv "github.com/djimenez/iconv-go"
"github.com/imroc/req"
)
func UTF8(resp *req.Resp) (s string, err error) { encoding := Guess(resp) if encoding == "utf-8" { s = resp.String() return } s, err = iconv.ConvertString(resp.String(), encoding, "utf-8") return }
var regMeta = regexp.MustCompile(<meta[^>]+charset=["']([^"']+)['"][^>]*>
)
func getEncodingFromBody(body string) string { result := regMeta.FindStringSubmatch(body) if len(result) < 2 { return "" } encoding := result[1] if len(encoding) == 0 { return "" } return strings.ToLower(encoding) }
var regCharset = regexp.MustCompile(charset=(\S+)
)
func getEncodingFromHeader(header http.Header) string { contentType := header.Get("Content-Type") if contentType == "" { return "" } result := regCharset.FindStringSubmatch(contentType) if len(result) < 2 { return "" } encoding := result[1] if len(encoding) == 0 { return "" } return strings.ToLower(encoding) }
func guess(resp *req.Resp) string { encoding := getEncodingFromHeader(resp.Response().Header) if encoding != "" { log.Println("encoding header:", encoding) return encoding }
encoding = getEncodingFromBody(resp.String())
if encoding != "" {
log.Println("encoding body:", encoding)
return encoding
}
return "utf-8"
}
func Guess(resp *req.Resp) string { encoding := guess(resp) if encoding == "gb2312" { return "gbk" } return encoding }
I'd avoid to parse html with regexp. What are the benefits of your solution against chardet?
It can save time, according to the request header or return to judge the charset, rather than directly guess charset using chardet. this process will spend a lot of time
We should measure if chardet is that slow.
the integration of chardet would be pretty straight forward:
--- a/colly.go
+++ b/colly.go
@@ -30,6 +30,7 @@ import (
"github.com/PuerkitoBio/goquery"
"github.com/kennygrant/sanitize"
+ "github.com/saintfish/chardet"
"github.com/temoto/robotstxt"
)
@@ -960,8 +961,15 @@ func randomBoundary() string {
func (r *Response) fixCharset() {
contentType := strings.ToLower(r.Headers.Get("Content-Type"))
if !strings.Contains(contentType, "charset") {
- return
+ d := chardet.NewTextDetector()
+ r, err := d.DetectBest(r.Body)
+ if err != nil {
+ return
+ }
+ contentType = r.Charset
}
if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
return
As an initial solution I've added chardet which can be enabled by setting Collector.DetectCharset
to true
. If you have a better solution please open a pr - preferably with tests/benchmarks.
package main
import ( "fmt"
)
func main() { c := colly.NewCollector()
} Visiting &map[User-Agent:[colly - https://github.com/gocolly/colly]] Visited &map[Content-Type:[text/html] Connection:[keep-alive] Vary:[Accept-Encoding] Expires:[Sat, 26 Jul 1997 05:00:00 GMT] Dpool_header:[luna139] Pragma:[no-cache] Sina-Lb:[aGEuMjAyLmcxLnloZy5sYi5zaW5hbm9kZS5jb20=] Server:[nginx/1.6.1] Date:[Thu, 28 Dec 2017 13:42:51 GMT] Cache-Control:[no-cache, must-revalidate] Sina-Ts:[N2FiMjljY2UgMCAxIDEgMiA2Cg==]] Link found: "\xb9ر\xd5" -> javascript:history.go(-1); Link found: "" -> javascript:; Link found: "\xbb\xbbһ\xd5\xc5" -> javascript:; Link found: "\xb5\xc7¼" -> javascript:; Link found: "\xb5\xda\xc8\xfd\xb7\xbd\xd5ʺ\xc5" -> https://passport.weibo.cn/signin/other?r=http%3A%2F%2Fweibo.cn Link found: "ע\xb2\xe1\xd5ʺ\xc5" -> http://m.weibo.cn/reg/index?&vt=4&wm=3349&wentry=&backURL=http%3A%2F%2Fweibo.cn Link found: "\xcd\xfc\xbc\xc7\xc3\xdc\xc2\xeb" -> https://passport.weibo.cn/forgot/forgot?entry=wapsso&from=0 Link found: "ȡ\xcf\xfb" -> javascript:; Link found: "\xd1\xe9֤\xc2\xeb\xb5\xc7¼" -> javascript:; Link found: "\xb9ر\xd5" -> javascript:history.go(-1); Link found: "ȷ\xc8\xcf" -> javascript:; Link found: "ʹ\xd3\xc3\xc6\xe4\xcb\xfb\xd5ʺŵ\xc7¼" -> javascript:;