microlinkhq / open

4 stars 2 forks source link

Parsing articles from The Guardian is broken #36

Closed jamessharp closed 4 years ago

jamessharp commented 4 years ago

Bug Report

Current Behavior

Try using microlink to parse e.g. https://www.theguardian.com/education/2020/jun/05/tell-us-about-your-young-childs-experiences-of-going-back-to-school

The title parses as "Thttps://www.theguardian.com/education/2020/jun/05/ell us about your young child’s https://www.theguardian.com/education/2020/jun/05/exphttps://www.theguardian.com/education/2020/jun/05/erihttps://www.theguardian.com/education/2020/jun/05/enchttps://www.theguardian.com/education/2020/jun/05/es of going back to school". The description and publisher are similarly disfigured. This happens using the nodejs skd, but also if you put the article into the microlink web demo too

Expected behavior/code

parsing not to have bits of the web address injected into it

Anything else This started some time on 2nd June I think

Kikobeats commented 4 years ago

Thanks for reporting, the bug should be resolved 🙂

https://api.microlink.io/?url=https://www.theguardian.com/education/2020/jun/05/tell-us-about-your-young-childs-experiences-of-going-back-to-school

Specifically, it was a thing related to how the HTML markup was sanitized before being processed

This is the commit that promotes the change: https://github.com/microlinkhq/html-get/commit/a9aecd6d5ef7e204b2fd19821dd25b20ff1a20a7

jamessharp commented 4 years ago

👍thanks for the speedy resolution!