j0k3r / graby

Graby helps you extract article content from web pages
MIT License
365 stars 74 forks source link

Setting prefetched content breaks after utf8 conversion #335

Open kolaente opened 1 year ago

kolaente commented 1 year ago

Prefetching content and then setting it with the setContentAsPrefetched as prefetched breaks that content after it gets converted to utf8. I suspect this is because response headers are not present.

I observed this when parsing LinkedIn posts, for example this one results in:

🔒� Apple has joined the chorus of voices warning about the potential risks posed by the #OnlineSafetyBill to end-to-end encryption. 💡Protecting our privacy and security is crucial.

(partial response for clarity)

The same part in the response I prefetched looks like this:

🔒️ Apple has joined the chorus of voices warning about the potential risks posed by the #OnlineSafetyBill to end-to-end encryption.\n\n💡Protecting our privacy and security is crucial.

Not prefetching the content and letting Graby handle it instead does not mangle the emojis. Unfortunately I need to use prefetched content because that lets me test it (I use Laravel's HTTP::get facade and that's mockable whereas Gabys internal response is not).

j0k3r commented 1 year ago

So, maybe we shouldn't convert the response to utf-8 if it comes from the prefetched content?

kolaente commented 1 year ago

Or provide a flag to either pass response headers to graby or tell it to not convert the response.