postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.41k stars 442 forks source link

Invalid decoded text #425

Open farmaan-appachhi opened 5 years ago

farmaan-appachhi commented 5 years ago

Expected Behavior

Parsed HTML should be properly encoded as per the original text

Current Behavior

The parsed html contains in invalid text .Might be because of decoding issue.

Steps to Reproduce

Some Links: https://www.newyorker.com/culture/the-new-yorker-interview/daenerys-tells-all-game-of-thrones-finale-emilia-clarke-beyonce

Detailed Description

I want to parse by fetching the html and giving to the parse instead of parser fetching the html.

Possible Solution

After looking at the code, it seem you are handling the case for browser only i.e. only if the html is provided from the browser, the proper encoding is checked from the html file. Ideally it should be able to decode the text irrespective of whether the parser is running on a browser or not

farmaan-appachhi commented 5 years ago

Screenshot for the parsed html

image

farmaan-appachhi commented 5 years ago

Fixed it by passing the html as Buffer with utf-8 instead of string as mentioned in the README

grigoriy-didorenko commented 5 years ago

Hi @farmaan-appachhi Could you please provide an example?

I am facing pretty much the same issue.

FarmaanElahi commented 5 years ago

For me problem was when I was trying to pass the local html as string. Using Buffer fixed the issue

Mercury.parse(url, {
        html: Buffer.from(html, "utf-8"),
        headers: {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " +
                "Chrome/60.0.3112.113 Safari/537.36"
        },
    })
grigoriy-didorenko commented 5 years ago

@farmaan-appachhi That saved my day (at least half), thank you

Could you provide the link where you found that?

FarmaanElahi commented 5 years ago

I tried debugging the code. Took my 5-6 hour figure out the issue. If you see the source code, they were using Buffer when fethcing the html but local files was used just as string. That's how I figured it out

On Tue, Jul 30, 2019, 6:32 PM grigoriy-didorenko notifications@github.com wrote:

@farmaan-appachhi https://github.com/farmaan-appachhi That saved my day (at least half), thank you

Could you provide the link where you found that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/postlight/mercury-parser/issues/425?email_source=notifications&email_token=AESIMEGJYBITDT7VUBYRV53QCA3XPA5CNFSM4HPSFQ5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3D4NWQ#issuecomment-516409050, or mute the thread https://github.com/notifications/unsubscribe-auth/AESIMEHW7YZFAESJT7KZXCLQCA3XPANCNFSM4HPSFQ5A .

csotiriou commented 5 years ago

I can verify that using Buffer works. This should have been mentioned in the README.

ttimasdf commented 3 years ago

+1 for Buffer works. It seems that the string should not be passed in for any case. or Mercury should detect the type and handle it specially.