honghaoz / Ji

Ji (戟) is an XML/HTML parser for Swift
MIT License
824 stars 65 forks source link

gbk error "encoding error : input conversion failed due to input error"(swift3) #44

Open foolbear opened 7 years ago

foolbear commented 7 years ago
let data = try? Data(contentsOf: URL(string: "http://m.263xs.com/info-83441/")!)        
let gbk = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(CFStringEncodings.GB_18030_2000.rawValue))
let string = String(data: data!, encoding: String.Encoding(rawValue: gbk)) // **it's OK!**
let doc = Ji(data: data, encoding: String.Encoding(rawValue: gbk), isXML: false) // **it show error: "encoding error : input conversion failed due to input error"**
foolbear commented 7 years ago

similar with #22 which resolved, but i catch error with same code。

let doc = Ji(htmlString: string!/*, encoding: String.Encoding(rawValue: gbk)*/)

the code above got error too with/out encoding(initialization without gbk encoding is making sense, i think)

foolbear commented 7 years ago

full test code and console error output as following:

        let doc = Ji(htmlURL: URL(string: "http://m.263xs.com/info-83441/")!)
//encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound

        let data = try? Data(contentsOf: URL(string: "http://m.263xs.com/info-83441/")!)
        let gbk = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(CFStringEncodings.GB_18030_2000.rawValue))
        let string = String(data: data!, encoding: String.Encoding(rawValue: gbk))
// it's OK

        let doc1 = Ji(htmlString: string!)
// encoding error : input conversion failed due to input error, bytes 0x2C 0xE5 0x9E 0x8B

        let doc2 = Ji(htmlString: string!, encoding: String.Encoding(rawValue: gbk))
//encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound

        let doc3 = Ji(data: data, encoding: String.Encoding(rawValue: gbk), isXML: false)
//encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound
foolbear commented 7 years ago

@honghaoz @zixun

honghaoz commented 7 years ago

@foolbear Hi, I've tried with this code below

let data = try? Data(contentsOf: URL(string: "http://m.263xs.com/info-83441/")!)

// Encodings
let gbk = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(CFStringEncodings.GB_18030_2000.rawValue))
let gbkEncoding = String.Encoding(rawValue: gbk)
let utf8Encoding = String.Encoding.utf8

// Try with GBK encoded string
let gbkString = String(data: data!, encoding: gbkEncoding)
let gbkDoc = Ji(htmlString: gbkString!, encoding: gbkEncoding)!

print(gbkDoc.xPath("//head/title")!) // Fails

// Try with UTF8 encoded data
let utf8Data = gbkString?.data(using: utf8Encoding)
let utf8Doc = Ji(data: utf8Data, encoding: utf8Encoding, isXML: false)!

print(utf8Doc.xPath("//head/title")!) // Success with warnings, could ignore it

The output is

encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound
[nil]
encoding error : input conversion failed due to input error, bytes 0x2C 0xE5 0x9E 0x8B
[<title>型月幻想乡的超越者-朱之月-科幻小说-263小说-免费小说阅读网手机阅读</title>]

So the second print prints out expected result.

This comment is helpful.

The error comes from libxml2, which is used by Ji. libxml2 works well with UTF8 encoding. So my idea is converting gbk encoded string, then convert it into utf8 encoded Data. So Ji could parse it.

Even though there's a warning where, you could ignore it and continue to parse what you want.

huhuegg commented 6 years ago

用这种方法最有效,直接把head全部去掉,否则有的页面会警告有的还是处理不了

let enc = CFStringConvertEncodingToNSStringEncoding(0x0632)
 guard let htmlStr = String(data: data, encoding: String.Encoding(rawValue: enc)), let html = htmlStr.components(separatedBy: "</head>").last else {
    return nil
}

 guard let doc = Ji(htmlString: html, encoding: .utf8) else {
    return nil
}