tid-kijyun / Kanna

Kanna(鉋) is an XML/HTML parser for Swift.
MIT License
2.42k stars 220 forks source link

Not getting back HTML from certain website, is it a parsing issue? #257

Closed jaysonng closed 2 years ago

jaysonng commented 2 years ago

Description:

I'm building an app that parses open graph data from the html. For this particular news site, its articles are returned with an error of

The operation couldn’t be completed. (Kanna.ParseError error 1.)

I'm hoping it's something we can fix with Kanna xml parser (I'm no xml expert so I can't go further than knowing I don't get back an HTML document) or is this a website issue ?

The link is

this article

If it helps, here is the URL Header response.

Retrieved data of size 188933, response = <NSHTTPURLResponse: 0x6000034ec3c0> { URL: https://www.philstar.com/headlines/2021/11/26/2143993/philippines-loosens-borders-coronavirus-cases-continue-drop } { Status Code: 200, Headers { "Accept-Ranges" = ( bytes ); Age = ( 0 ); "Cache-Control" = ( "no-store, no-cache, must-revalidate, post-check=0, pre-check=0" ); Connection = ( "keep-alive" ); "Content-Encoding" = ( gzip ); "Content-Type" = ( "text/html; charset=utf8" ); Date = ( "Fri, 26 Nov 2021 14:53:30 GMT" ); Expires = ( "Thu, 19 Nov 1981 08:52:00 GMT" ); "Keep-Alive" = ( "timeout=2" ); P3P = ( "CP=\"IDC DSP COR CURa ADMa OUR IND PHY ONL COM STA\"" ); Pragma = ( "no-cache" ); "Referrer-Policy" = ( "strict-origin-when-cross-origin", "strict-origin-when-cross-origin" ); Server = ( nginx ); "Set-Cookie" = ( "PHPSESSID=dcbs3uih5l9fdgl9qqcpdu4g27; path=/; HTTPOnly; Secure; secure; HttpOnly", "visitor=y; expires=Sat, 26-Nov-2022 14:53:35 GMT; path=/; HTTPOnly; Secure" ); "Transfer-Encoding" = ( Identity ); Vary = ( "Accept-Encoding" ); "X-Cache" = ( MISS ); "X-Cache-Hits" = ( 0 ); } }

thanks,

Installation method:

Kanna version (or commit hash):

5.2.7

swift --version

Swift 5.5

Xcode version (optional):

13.1

jaysonng commented 2 years ago

after a bit more debugging, I've found out that it's a problem with the encoding.

.utf8 doesn't work but .ascii does.

I fixed the problem by changing the function

HTML(html: Data, url: String? = nil, encoding: String.Encoding, option: ParseOption = kDefaultHtmlParseOption)

to

// NSData
public func HTML(html: Data, url: String? = nil, encoding: String.Encoding, option: ParseOption = kDefaultHtmlParseOption) throws -> HTMLDocument {
    if let htmlStr = String(data: html, encoding: encoding) {
        return try HTML(html: htmlStr, url: url, encoding: encoding, option: option)
    } else if let htmlStr = String(data: html, encoding: .ascii) {
        return try HTML(html: htmlStr, url: url, encoding: encoding, option: option)
    } else {
        throw ParseError.EncodingMismatch
    }
}

I created a PR for this issue.

tid-kijyun commented 2 years ago

It seems that the website with that URL uses a charset other than UTF-8. (The header response states that charset is UTF-8, but it doesn't seem to be correct.)

I think you need to handle this case in your codebase, not in library(Kanna).


You can see that charset is not UTF-8 by following the steps below.

  1. Download the page source
  2. Check the charset
$ curl -L https://www.philstar.com/headlines/2021/11/26/2144018/philippines-intently-monitoring-new-covid-19-variant-detected-south-africa > ./source.html
$ file --mime ./source.html
./dump.txt: text/html; charset=unknown-8bit
jaysonng commented 2 years ago

Got it.

Thanks for checking.

So the fix I did a PR on won't be pulled ? I need to move the logic into my library?

tid-kijyun commented 2 years ago

Yes, please move the logic into your code.

Thanks