keepcosmos / readability

Readability is Elixir library for extracting and curating articles.
Apache License 2.0
252 stars 58 forks source link

Comments sections with many URLs is mistaken for the article text #63

Open vkryukov opened 4 days ago

vkryukov commented 4 days ago
{:ok, %HTTPoison.Response{status_code: 200, body: html}} = HTTPoison.get("https://raw.githubusercontent.com/mozilla/readability/refs/heads/main/test/test-pages/archive-of-our-own/source.html")
html |> Readability.article |> Readability.readable_html

returns

"<div><div id=\"kudos\"><p><a href=\"/users/KuroKitsuneNoYoko\">KuroKitsuneNoYoko</a>, <a href=\"/users/Senpakuxkira\">Senpakuxkira</a>, <a href=\"/users/jennyjenka\">jennyjenka</a>, <a href=\"/users/nicarose24\">nicarose24</a>, <a href=\"/users/Devilspawn2002\">Devilspawn2002</a>, <a href=\"/users/fixatro\">fixatro</a>, <a href=\"/users/FeltLikeWritingAndHereIAm\">FeltLikeWritingAndHereIAm</a>, <a href=\"/users/ShareBearNat\">ShareBearNat</a>, <a href=\"/users/Revolvers_and_violets\">Revolvers_and_violets</a>, <a href=\"/users/girl_with_a_sword\">girl_with_a_sword</a>, <a href=\"/users/PerrinOfApples\">PerrinOfApples</a>, <a href=\"/users/Sky_King\">Sky_King</a>, <a href=\"/users/gREat_unreST\">gREat_unreST</a>, <a href=\"/users/Sayriel\">Sayriel</a>, <a href=\"/users/SpaceJonah\">SpaceJonah</a>, <a href=\"/users/Sofiacasdfg\">Sofiacasdfg</a>, <a href=\"/users/thelackvoid\">thelackvoid</a>, <a href=\"/users/WormhuskCrown\">WormhuskCrown</a>, <a href=\"/users/Juiji\">Juiji</a>, <a href=\"/users/Graykip\">Graykip</a>, <a href=\"/users/Lunatsuki\">Lunatsuki</a>, <a href=\"/users/KaitoKitsune\">KaitoKitsune</a>, <a href=\"/users/ShadowYonni\">ShadowYonni</a>, <a href=\"/users/A_regrettable_choice_of_words\">A_regrettable_choice_of_words</a>, <a href=\"/users/Atriel\">Atriel</a>, <a href=\"/users/Kiki_Inu_Page\">Kiki_Inu_Page</a>, <a href=\"/users/Athi816\">Athi816</a>, <a href=\"/users/Andrea_Victoria\">Andrea_Victoria</a>, <a href=\"/users/LeafoftheFox\">LeafoftheFox</a>, <a href=\"/users/Iateyourcookies\">Iateyourcookies</a>, <a href=\"/users/AirQuotes\">AirQuotes</a>, <a href=\"/users/octaviaxanadu\">octaviaxanadu</a>, <a href=\"/users/DemiwitchWinchester\">DemiwitchWinchester</a>, <a href=\"/users/November_Clouds\">November_Clouds</a>, <a href=\"/users/LittleNovaStar\">LittleNovaStar</a>, <a href=\"/users/Dolema\">Dolema</a>, <a href=\"/users/yuki_27\">yuki_27</a>, <a href=\"/users/TS4Life\">TS4Life</a>, <a href=\"/users/prince_doomed\">prince_doomed</a>, <a href=\"/users/MemeticWarfare\">MemeticWarfare</a>, <a href=\"/users/Lethe_Rem\">Lethe_Rem</a>, <a href=\"/users/improbablyamartian\">improbablyamartian</a>, <a href=\"/users/charm13insomnia\">charm13insomnia</a>, <a href=\"/users/Whoevenisshe\">Whoevenisshe</a>, <a href=\"/users/Scaledraws\">Scaledraws</a>, <a href=\"/users/Dawwnni\">Dawwnni</a>, <a href=\"/users/loganesque\">loganesque</a>, <a href=\"/users/lil_uno\">lil_uno</a>, <a href=\"/users/kusibi21\">kusibi21</a>, <a href=\"/users/derbelisca\">derbelisca</a>, <a id=\"kudos_summary\" href=\"/works/11808918/kudos\">and 3711 more users</a><span><a href=\"/users/Aquamarine_Ocean\">Aquamarine_Ocean</a>, <a href=\"/users/Noone_is_Perfect_Im_noone\">Noone_is_Perfect_Im_noone</a>, <a href=\"/users/FireInYourDreams\">FireInYourDreams</a>, <a href=\"/users/JaceTheAce34\">JaceTheAce34</a>, <a href=\"/users/StallingforGreatness\">StallingforGreatness</a>, <a href=\"/users/flushed_flue\">flushed_flue</a>, <a href=\"/users/Kriegswolf\">Kriegswolf</a>, <a href=\"/users/blue_birb\">blue_birb</a>, <a href=\"/users/Krakenonkrack437\">Krakenonkrack437</a>, <a href=\"/users/Goblin17\">Goblin17</a>, <a href=\"/users/alchemicalApocalypse\">alchemicalApocalypse</a>, <a href=\"/users/tunafishprincess\">tunafishprincess</a>, <a href=\"/users/loving_1D_louiszaynniallharryliam\">loving_1D_louiszaynniallharryliam</a>, <a href=\"/users/Briknanana\">Briknanana</a>, <a href=\"/users/Jocelyn523\">Jocelyn523</a>, <a href=\"/users/ImaDMS\">ImaDMS</a>, <a href=\"/users/Ahriel_sinalas\">Ahriel_sinalas</a>, <a href=\"/users/Moonyshewolf\">Moonyshewolf</a>, <a href=\"/users/original_name_that_i_thought_of\">original_name_that_i_thought_of</a>, <a href=\"/users/Silkesukkermaas\">Silkesukkermaas</a>, <a href=\"/users/1existential_melon\">1existential_melon</a>, <a href=\"/users/Embrexial\">Embrexial</a>, <a href=\"/users/sophiekwat\">sophiekwat</a>, <a href=\"/users/wisdoms_daughter\">wisdoms_daughter</a>, <a href=\"/users/myabug01\">myabug01</a>, <a href=\"/users/TsukiNona\">TsukiNona</a>, <a href=\"/users/miraculousemily47\">miraculousemily47</a>, <a href=\"/users/TheLivingParadox\">TheLivingParadox</a>, <a href=\"/users/white_goddess\">white_goddess</a>, <a href=\"/users/heartofthetardis\">heartofthetardis</" <> ...

and not the text of the article.

Valian commented 3 days ago

This one's tricky as it impacts the internal heuristic of deciding which part of the page is an article. Solving this case might break some others... 🤔

Would be best to have a good test suite #61 for this.

vkryukov commented 3 days ago

Agree, this is just a placeholder for until I can think of a good test case for this (maybe just re-purpose the Readability.js test case).