keepcosmos / readability

Readability is Elixir library for extracting and curating articles.
Apache License 2.0
245 stars 56 forks source link

Bug when extracting article from HTML #28

Closed edevil closed 10 months ago

edevil commented 7 years ago

Example URL: "http://www.techhive.com/article/3158435/home-tech/43-off-tp-link-smart-led-wi-fi-light-bulb-dimmable-and-alexa-compatible-deal-alert.html#tk.rss_smartappliance"

Can summarize it:

iex(6)> Readability.summarize("http://www.techhive.com/article/3158435/home-tech/43-off-tp-link-smart-led-wi-fi-light-bulb-dimmable-and-alexa-compatible-deal-alert.html#tk.rss_smartappliance")        %Readability.Summary{article_html: "<div><div id=\"drr-container\"><p>TP-Link has discounted its 50W smart bulb <a href=\"https://www.amazon.com/TP-Link-Dimmable-Equivalent-Amazon-LB100/dp/B01HXM8XF6?psc=1&SubscriptionId=AKIAIRZJHSP2SKQIWVZA&tag=techconnect00-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B01HXM8XF6\" rel=\"nofollow\">43% to just $19.99</a>. Use the Kasa app to turn on/off or dim from anywhere in the world. Set up a schedule, set the mood, and even control with your voice via an Alexa-enabled device such as Echo or Dot. Reviewers rate 4 out of 5 stars (see reviews) on Amazon, where you can get yourself one (or more) for just $20, a good deal considering it typically lists north of $20 and sometimes $30 with various online retailers. <a href=\"https://www.amazon.com/TP-Link-Dimmable-Equivalent-Amazon-LB100/dp/B01HXM8XF6?psc=1&SubscriptionId=AKIAIRZJHSP2SKQIWVZA&tag=techconnect00-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B01HXM8XF6\" rel=\"nofollow\">See the discounted TP-Link smart LED bulb on Amazon</a>.</p><p>This story, \"43% off TP-Link Smart LED Wi-Fi Light Bulb, 50W Dimmable and Alexa Compatible - Deal Alert\" was originally published by\n<span><span><a href=\"http://www.techconnect.com/\" rel=\"nofollow\">TechConnect</a></span></span>.</p><div><div id=\"\">To comment on this article and other TechHive content, visit our <a href=\"https://www.facebook.com/techhivemedia/\">Facebook</a> page or our <a href=\"https://twitter.com/techhive\">Twitter</a> feed. </div></div></div></div>",
 article_text: "TP-Link has discounted its 50W smart bulb 43% to just $19.99. Use the Kasa app to turn on/off or dim from anywhere in the world. Set up a schedule, set the mood, and even control with your voice via an Alexa-enabled device such as Echo or Dot. Reviewers rate 4 out of 5 stars (see reviews) on Amazon, where you can get yourself one (or more) for just $20, a good deal considering it typically lists north of $20 and sometimes $30 with various online retailers. See the discounted TP-Link smart LED bulb on Amazon.\nThis story, \"43% off TP-Link Smart LED Wi-Fi Light Bulb, 50W Dimmable and Alexa Compatible - Deal Alert\" was originally published by\nTechConnect.\nTo comment on this article and other TechHive content, visit our Facebook page or our Twitter feed.",
 authors: ["DealPost Team"],
 title: "43% off TP-Link Smart LED Wi-Fi Light Bulb, 50W Dimmable and Alexa Compatible - Deal Alert"}

Extracting article directly from HTML fails:

iex(6)> Readability.article(HTTPoison.get!("http://www.techhive.com/article/3158435/home-tech/43-off-tp-link-smart-led-wi-fi-light-bulb-dimmable-and-alexa-compatible-deal-alert.html#tk.rss_smartappliance").body)
** (FunctionClauseError) no function clause matching in Floki.HTMLTree.build/1
          (floki) lib/floki/html_tree.ex:14: Floki.HTMLTree.build(nil)
          (floki) lib/floki/finder.ex:48: Floki.Finder.find_selectors/2
          (floki) lib/floki/filter_out.ex:17: Floki.FilterOut.filter_out/2
          (floki) lib/floki.ex:210: Floki.text/2
    (readability) lib/readability/helper.ex:75: Readability.Helper.text_length/1
    (readability) lib/readability/article_builder.ex:32: Readability.ArticleBuilder.build/2

This is because in the second case the "clean_conditionally: true" option is passed to the Sanitizer. Using different options is a bit surprising, nevertheless it should not crash.

iver commented 2 years ago

The link is no longer available, do you have another example?

iex> Readability.summarize("http://www.techhive.com/article/3158435/home-tech/43-off-tp-link-smart-led-wi-fi-light-bulb-dimmable-and-alexa-compatible-deal-alert.html#tk.rss_smartappliance")
%Readability.Summary{
  title: "301 Moved Permanently",
  authors: nil,
  article_html: "<div></div>",
  article_text: ""
}
Valian commented 10 months ago

It's because options are different, in case of Readibility.article we use default options, in case of Readibility.summarize - no. I think it's worth unifying.

Nevertheless, original bug shouldn't happen anymore because we use a different helper for calculating text_length- #53

I'll close this issue in favor of #57