Open foolbear opened 7 years ago
similar with #22 which resolved, but i catch error with same code。
let doc = Ji(htmlString: string!/*, encoding: String.Encoding(rawValue: gbk)*/)
the code above got error too with/out encoding(initialization without gbk encoding is making sense, i think)
full test code and console error output as following:
let doc = Ji(htmlURL: URL(string: "http://m.263xs.com/info-83441/")!)
//encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound
let data = try? Data(contentsOf: URL(string: "http://m.263xs.com/info-83441/")!)
let gbk = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(CFStringEncodings.GB_18030_2000.rawValue))
let string = String(data: data!, encoding: String.Encoding(rawValue: gbk))
// it's OK
let doc1 = Ji(htmlString: string!)
// encoding error : input conversion failed due to input error, bytes 0x2C 0xE5 0x9E 0x8B
let doc2 = Ji(htmlString: string!, encoding: String.Encoding(rawValue: gbk))
//encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound
let doc3 = Ji(data: data, encoding: String.Encoding(rawValue: gbk), isXML: false)
//encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound
@honghaoz @zixun
@foolbear Hi, I've tried with this code below
let data = try? Data(contentsOf: URL(string: "http://m.263xs.com/info-83441/")!)
// Encodings
let gbk = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(CFStringEncodings.GB_18030_2000.rawValue))
let gbkEncoding = String.Encoding(rawValue: gbk)
let utf8Encoding = String.Encoding.utf8
// Try with GBK encoded string
let gbkString = String(data: data!, encoding: gbkEncoding)
let gbkDoc = Ji(htmlString: gbkString!, encoding: gbkEncoding)!
print(gbkDoc.xPath("//head/title")!) // Fails
// Try with UTF8 encoded data
let utf8Data = gbkString?.data(using: utf8Encoding)
let utf8Doc = Ji(data: utf8Data, encoding: utf8Encoding, isXML: false)!
print(utf8Doc.xPath("//head/title")!) // Success with warnings, could ignore it
The output is
encoding error : input conversion failed due to input error, bytes 0x2D 0xD6 0xEC 0xD6
parser error : Internal error, xmlCopyCharMultiByte 0x1F7D7B out of bound
[nil]
encoding error : input conversion failed due to input error, bytes 0x2C 0xE5 0x9E 0x8B
[<title>型月幻想乡的超越者-朱之月-科幻小说-263小说-免费小说阅读网手机阅读</title>]
So the second print
prints out expected result.
This comment is helpful.
The error comes from libxml2
, which is used by Ji
. libxml2
works well with UTF8 encoding. So my idea is converting gbk encoded string, then convert it into utf8 encoded Data
. So Ji
could parse it.
Even though there's a warning where, you could ignore it and continue to parse what you want.
用这种方法最有效,直接把head全部去掉,否则有的页面会警告有的还是处理不了
let enc = CFStringConvertEncodingToNSStringEncoding(0x0632)
guard let htmlStr = String(data: data, encoding: String.Encoding(rawValue: enc)), let html = htmlStr.components(separatedBy: "</head>").last else {
return nil
}
guard let doc = Ji(htmlString: html, encoding: .utf8) else {
return nil
}