mozilla / readability

A standalone version of the readability lib
Other
8.6k stars 588 forks source link

Readability's .parse() hangs when using deno-dom against the bbc homepage #716

Closed ralyodio closed 2 years ago

ralyodio commented 2 years ago

Trying to parse the body from a bbc link:

https://bbc.co.uk

 37     try {
 38       console.log('parsing article');
 39       //article = new Readability(doc.window.document).parse();
 40       article = new Readability(doc).parse();
 41       console.log('articled parsed');
 42       console.log(article);
 43     } catch(err) {
 44       console.error(err);
 45     }

It never reaches line 41 with bbc html

I'm using deno-dom in deno

gijsk commented 2 years ago

Off-hand that sounds like an issue with deno-dom, given that it works if you just run this in the console in a web browser after pasting in all of Readability.

This is likely going to need more detailed steps to reproduce in order to be actionable (ie from scratch, how are you setting this up; how are you fetching the document, how are you parsing it, and even where you're located might matter because the BBC makes pretty significant changes to their site depending on where you are, probably based on geoip). A minimal testcase would be even better.

chovyprognos commented 2 years ago
import Readability from "https://cdn.esm.sh/v57/moz-readability@0.2.1/es2021/moz-readability.js"
import { DOMParser } from "https://raw.githubusercontent.com/b-fuze/deno-dom/master/deno-dom-wasm.ts"

const url = 'https://www.bbc.com/news/science-environment-59210395';
console.log(url);

const res = await fetch(url, {
    headers: {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0'
    }
});
if (!res.ok) {
    console.error(await res.text());
}
const body = await res.text();
const doc = new DOMParser().parseFromString(body, "text/html");
let article;

try {
    console.log('parsing article');
    //article = new Readability(doc.window.document).parse();
    article = new Readability(doc).parse();
    console.log('articled parsed');
    console.log(article);
} catch(err) {
    console.error(err);
}

console.log(url, body, article);

$ deno run --allow-all --unstable app.ts

ralyodio commented 2 years ago

I’m hitting it from USA

ralyodio commented 2 years ago

this page gets the content, but the cookie gdpr crap is the only thing that gets parsed.

https://www.ft.com/content/2bbedfde-c9da-4bc6-b2e6-af0698240851

ralyodio commented 2 years ago

short of handling each site, is there a way to just remove the top layer div that is z-index highest from the dom?

this might get rid of GDPR garbage divs

by the way, deno-dom fixed their issue that was causing this library to hang. So that's fixed. Just need a way to remove overlays for GDPR and other crap.

ralyodio commented 2 years ago

I found this code here From a google chrome extension which basically adds custom css to hide cookie banners:

var html = [
        // This is causing two scrollbars on some sites. Need to remove and find another way to re-enable scrolling with full-page modals.
        "body, html { overflow: unset !important }",
        "div[id *= 'cookie'] { display: none !important }",
        "div[id *= 'Cookie'] { display: none !important }",
        "div[class *= 'Cookie'] { display: none !important }",
        "div[class *= 'cookie'] { display: none !important }",
        "div[id *= 'CCPA'] { display: none !important }",
        "div[class *= 'CCPA'] { display: none !important }",
        ".preloaded_lightbox {display: none !important}",
        ".omaha-background {display: none !important}",
        "#hs-eu-cookie-confirmation { display: none !important }",
        "#onesignal-popover-container { display: none !important }",
        ".c-banner-advert-sticky { display: none !important }",
        "w-div { display: none !important }",
        ".mfg-bg.mfp-ready { display: none !important }",
        ".mfg-wrap.mfp-ready { display: none !important }",
        ".ig_action_bar.ig_container { display: none !important }",
        "div[role='dialog'][aria-label='cookieconsent'].cc-window { display: none !important }",
        ".mkt-at-toaster { display: none !important }",
        ".subscription-toaster--hidden, .subscription-toaster { display: none !important }",
        ".bbccookies-banner.orb-banner-wrapper.bbccookies-d { display: none !important }",
        ".md-modal2.flex { display: none !important }",
        ".modal-backdrop.in { display: none !important }",
        ".modal.email-submission-modal { display: none !important }",
        "#rt-roadblock-modal.rt-roadblock-modal modal-dialog in { display: none !important }",
        "#webmdHoverOverlay, #webmdHoverWrapper { display: none !important }",
        ".alert.ad-blocker { display: none !important }",
        "#signup-bar.fixed.bottom { display: none !important }",
        ".pushcrew-chrome-style-notification, .pushcrew-chrome-style-notification-safari { display: none !important }",
        ".HB-Slider.hb-animated { display: none !important }",
        ".mcwidget-overlay { display: none !important }",
        "section.newsletter-signup { display: none !important }",
        "div#intentPreview { display: none !important }",
        "#intentOpacityDiv { display: none !important }",
        "#wzrk_wrapper { display: none !important }",
        "#quick-signup { display: none !important }",
        ".articles-navigation { display: block !important }",
        ".klaviyo-form { display: none !important }",
        "section#welcome-modal.welcome-modal { display: none !important }",
        "cnx.cnx-video-container { display: none !important }", // autoplaying video on washingtonexaminer.com
        ".push-body.nao-assinante { display: none !important }", // https://www.estadao.com.br/
        ".ablock  { display: none !important }", // https://www.phoenixnewtimes.com/news/uber-car-in-autonomous-mode-when-it-killed-woman-police-say-10247590
        ".all-body  { filter: initial !important }", // https://www.phoenixnewtimes.com/news/uber-car-in-autonomous-mode-when-it-killed-woman-police-say-10247590
        ".a-modal-scroller.a-declarative { filter: initial !important }", // amazon full-page modal "Donate to smile charity"
        "div[data-video-player=container] { display: none !important }", // autoplaying video on cnet.com
        ".fb_lightbox-overlay  { display: none !important }",
        "#provely-widget { display: none !important }", // https://www.toonly.com/
        "div.profile-top-bar-container-searchForm { display: none !important }", // mylife.com
        "div.relatedBook { display: none !important }", // http://www.informit.com/articles/article.aspx?p=18225&seqNum=2
        "div.-gtqce-onaare { display: none !important }", //  sportingnews.com cookie banner
        "iframe[allow='autoplay; fullscreen'][x-enc='src'] { display: none !important }", //  sportingnews.com <-- autoplaying video
        "div.s2nPlayerFrame { display: none !important }", // https://smallbusiness.chron.com/open-new-tab-clicking-google-chrome-63612.html
        "iframe#um_ultimedia_wrapper_iframeUltimedia  { display: none !important }", // denofgeek.com autoplaying video
        "div.sticky-video  { display: none !important }", // https://www.independent.co.uk/news/science/biological-clock-ageing-turn-back-reverse-study-new-a9094261.html
        ".f-NJD1R8-1- { display: none !important }",
        ".tve-leads-lightbox { display: none !important }", // mouse out of page:   https://www.rubyguides.com/2019/07/ruby-string-concatenation/
        ".mailmunch-popover, mailmunch-overlay { display: none !important }", // mouse out of page:  http://inmyownterms.com/take-note-languages-codes-versus-country-codes/
        ".email-capture-popup-wrapper { display: none !important }", // email signup after 1 minute:  https://www.hunker.com/13400974/how-much-do-bath-fitters-cost
        ".jwplayer__container--sticky { display: none !important }", // hunker.com -- self-playing video
        "#emailSignUpDialog { display: none !important }", // lowes.com newsletter signup
        ".pum-overlay { display: none !important }", // https://daily.jstor.org/the-rise-and-fall-of-the-pet-bird/
        "div[class *= 'sp_message_'], div[id *= 'sp_message_'] { display: none !important }", // denofgeek.com/uk
        "div[class *= 'sp_veil'] { display: none !important }", // denofgeek.com/uk
        "div#iubenda-cs-banner { display: none !important }", // http://www.finsmes.com/2019/10/edify-labs-secures-10m-in-seed-funding.html
        ".boxzilla-container { display: none !important }", // https://thebaffler.com/latest/corrupted-headspace-semley
        "._evidon_banner { display: none !important }", // https://www.1843magazine.com/food-drink/ikejime-a-humane-way-to-kill-fish-that-makes-them-tastier
        "div#paywall-banner { display: none !important }", // https://www.bloomberg.com/quote/APPL:US
        ".sa-soft-rb-v1 { display: none !important }", // https://seekingalpha.com/news/3505376-pg-and-e-rejects-san-franciscos-2_5b-asset-purchase-offer
        "#CybotCookiebotDialog { display: none !important }", // https://www.fanatical.com/en/
        ".a-declarative { display: none !important }", // amazon popup
        ".login-modal-div { display: none !important }", // https://www.geeksforgeeks.org/css-attribute-selector/
        ".ModalContainer_widget_bannerVisible { display: none !important }", // https://www.teslarati.com/spacex-prepared-first-dual-falcon-9-fairing-catch/
        "div#email-capture { display: none !important }", // https://www.techwalla.com/articles/how-to-set-up-wireless-printing-with-a-samsung-clp-315w
        "div[data-hypernova-key='popup_block']  { display: none !important }", // wayfair.com
        "div.exit_popup { display: none !important }", // when mouse leaves page, newsletter signup popup:  https://severalnines.com/blog/how-perform-online-schema-changes-mysql-using-gh-ost
        "div#consent_blackbar { display: none !important }", // https://www.redhat.com/en/about/press-releases/ibm-closes-landmark-acquisition-red-hat-34-billion-defines-open-hybrid-cloud-future
        "div#sitewide-lightbox-container { display: none !important }", // newsletter signup overstock.com
        "div.Campaign.CampaignType--popup { display: none !important }", // mouseout popup https://www.guru99.com/oltp-vs-olap.html
        "div#adtoniq-msgr-bar { display: none !important }", // https://dzone.com/articles/what-is-viewencapsulation-in-angular
        "div.fancybox-overlay { display: none !important }", // macworld.com fullscreen popup
        ".fancybox-lock { overflow: auto !important }", // overwrite CSS that turns off scrolling for FancyBox
        ".met-flyout  { display: none !important }", // latimes.com
        "div[id *= 'simplemodal'], div.simplemodal-container { display: none !important }", // developers.meethue.com newsletter signup modal
        ".mpp-container { display: none !important }", // roadtovr.com newsletter signup
        "div#WhitelistOverlayModalBackground { display: none !important }", // kbb.com ad blocker modal
        "div.wrap div.mh-message-bar { display: none !important }", // spotify.com/it cookie banner
        "div.klaviyo-form.needsclick { display: none !important }", // zocalofoods.com/cart newsletter popup
        "div.dailymotion-cpe { display: none !important }", // video autoplayer on chichester.co.uk/news
        "iframe.syndicated-modal { display: none !important }", // https://www.propublica.org/article/the-irs-decided-to-get-tough-against-microsoft-microsoft-got-tougher
        ".nytc---modal-window---windowContainer { display: none !important }", // https://cooking.nytimes.com/recipes/9465-the-250-cookie-recipe
        "#_evidon-barrier-wrapper { display: none !important }", // cookie banner https://www.cosmopolitan.com/uk/reports/a32050083/rubber-gloves-coronavirus-spreading-tiktok/
        ".tp-modal, .tp-backdrop.tp-active { display: none !important }", // account signup thedailybest.com
        ".tp-modal-open { overflow: auto !important }", // turn on the page scrolling for TP-Modal pages (thedailybest.com)
        ".jumpstart-sticky-active { position: static !important }", // video autoplayer https://www.myrecipes.com/recipe/baked-italian-style-cauliflower
        ".cc_banner.cc_container { display: none !important }", // cookiie banner https://thelittlepotcompany.co.uk/blogs/pottery/making-your-own-pottery-glaze
        ".fc-ab-root { display: none !important }", // disable ad blocker http://www.foxnews.com
        ".sok-browser-consent-modal { display: none !important }", // cookie banner https://yhteishyva.fi/reseptit/suklaa-mascarponekakku/recipe-5482
        "#onetrust-consent-sdk { display: none !important }", // privacy popup modal https://www.cnn.com
        "#ins-frameless-overlay, .ins-preview-wrapper { display: none !important }", // full page modal to turn on notifications? super stupid  https://english.alarabiya.net/en/features/2019/10/02/Harassment-and-imprisonment-Life-as-persecuted-Christian-in-Iran.html
        ".soundest-form-image-left-overlay, .soundest-form-image-left { display: none !important }", // newsletter signup https://promarinesupplies.com/blogs/blog/tips-for-tinting-epoxy-resin
        "#consent-banner { display: none !important } ", // cookie banner https://www.meillakotona.fi/reseptit/meheva-suklaakakku
        ".modal-overlay.nlPopup.active { display: none !important }", // delayed newsletter signup timesofisrael.com
        "#abd-banner { display: none !important }", // disable ad blocker macrotrends.net
        ".leadinModal { display: none !important }", // mouseout banner https://blog.hubspot.com/marketing/best-email-subject-lines-list
        ".seo_lightbox.entered.user_dismiss { display: none !important }", // mouseout banner scribd.com
        ".periodic-modal-popup { display: none !important }", // newsletter signup adoptapet.com
        "#mailing-list-popup { display: none !important }", // newsletter signup https://jacobinmag.com/2020/04/ecuador-lenin-moreno-coronavirus-rafael-correa-health-care
        "#mailchimp-top-bar { display: none !important }", //  newsletter signup http://www.finsmes.com/2019/10/edify-labs-secures-10m-in-seed-funding.html
        "div.mfp-bg { display: none !important }", // https://www.zyciepabianic.pl/styl-zycia/jak-dlugo-idzie-przesylka-z-aliexpress.html
        "aside#cookie-warn { display: none !important }", // https://www.zyciepabianic.pl/styl-zycia/jak-dlugo-idzie-przesylka-z-aliexpress.html
        "div#tc_priv_CustomOverlay { display: none !important }", // cookie banner https://ovh.co.uk
        ".ctct-popup-wrapper { display: none !important }", // newsletter signup http://www.lyndhurst-oh.com/departments.html
        ".c-article-meter-banner { display: none !important }", // signup banner https://thespec.com
        "#webmdHoverOverlay { display: none !important }", // newsletter signup https://webmd.com
        "ul.notices.notices--bottom_fixer.js-notices { display: none !important }", // cookie banner https://forums.terraria.org/index.php
        ".cookie-bar { display: none !important }", // cookie banner https://www.flashpoint-intel.com/blog/iraq-threat-update-june-2020/
        ".fbs-auth__adblock { display: none !important }", // adblock banner https://www.forbes.com/sites/victoriaforster/2020/05/12/wearing-a-mask-to-reduce-the-spread-of-coronavirus-will-not-give-you-carbon-dioxide-poisoning/#ee133e517f56
        "#cookiePopup + div.modal-backdrop { display: none !important }", // get rid of dark background behind cookie window https://www.t-mobile.nl/
        "section#ccc[slider-optin=''] { display: none !important }", // cookie notification  https://www.wien.info/de/sightseeing/architektur-design
        ".svs-popup-root { display: none !important }", // newsletter signup https://www.iwillteachyoualanguage.com/learn/french/french-tips/french-punctuation
        ".sumome-react-wysiwyg-component { display: none !important }", // newsletter signup https://www.iwillteachyoualanguage.com/learn/french/french-tips/french-punctuation
        "section#ensModalWrapper  { display: none !important }", // cookie banner on https://www.sherwin-williams.com
        "#ccpaCookieBanner { display: none !important }", // cookie banner https://www.paypal.com
        ".formkit-overlay { display: none !important }", // newsletter signup https://alldayidreamaboutfood.com
        ".ab-in-app-message { display: none !important }", // https://www.kcra.com/article/gov-newsom-coronavirus-update-nov-16/34690298
        "#PopupSignupForm_0 .mc-modal { display: none !important }", // signup form https://mobiledevmemo.com/apple-arcade-one-year-on-no-killer-games-cant-compete-with-free/
        "div#view-offer.view-offer, div.piano-fixed-footer-two  { display: none !important }", // signup form https://www.washingtontimes.com/news/2020/dec/1/unraveling-another-democratic-fantasy/
        "div[id *= 'sp_message_container'] { display: none !important }", // a number of sites use this for cookie banners
        ".sp-message-open { position: initial !important; overflow: initial !important }", // goes with sp_message
        "div.navi-push-notification-prompt { display: none !important }", // notification banner https://www.bloomberg.com/quote/IAG:LN
        "div.cmpboxBG { display: none !important }", // cookie banner https://www.werstreamt.es/
        "div[id*='gdpr'] { display: none !important }",
        "div.truste_overlay, div.truste_box_overlay  { display: none !important }", // cookie banner https://www.eurofinsgenomics.eu/en/ecom/checkout/login-register/
        ".gTMtLb > div[class=''], .m114nf.aID8W, #cnsw { display: none !important }", // "before you continue" popup
        "#usercentrics-root { display: none !important }", // cookie banner https://www.ka-news.de/region/karlsruhe/Karlsruhe~/spinnenlaeufer-kalikokrebs-gottesanbeterin-diese-exoten-leben-in-karlsruhe;art6066,2587181
        "div#didomi-custom-host { display: none !important }", // cookie banner https://www.hilti.si/
        "#udtCookiebox, #udtDark, #udtWhite { display: none !important }", // cookie banner https://www.wolterskluwer.com/nl-nl/solutions/basecone
        "cmp-banner { display: none !important }", // cookie banner https://www.sat1.de/
        "div[id*='sp_veil'] { display: none !important }", // sp_veil cookie banners
        "div[id*='c-dialog'] { display: none !important }", // notification banner https://welt.de
        "div[class*='style__modal___'] { display: none !important }", // newsletter https://store.dji.com
        "#locked-screen-notifications { display: none !important }", // https://www.24symbols.com/author/derek-prince?id=19499&locale=es
        ".widget-overlay-mask { display: none !important }", // background overlay https://hotels.com
        ".email-overlay, .overlay-wrapper { display: none !important }", // mouseout banner https://www.howtogeek.com/437624/how-to-enable-google-chromes-new-extensions-menu/
        "div[data-tracking-opt-in-overlay='true'] { display: none !important }", // cookie banner https://pathofexile.gamepedia.com/Path_of_Exile_Wiki
        ".web-snackbar__surface { display: none !important }", // cookie banner https://web.dev
        "#didomi-host::before { display: none !important }", // cookie banner background https://www.filmweb.pl
        "body.preventScroll { position: initial !important }", // https://www.filmweb.pl
        "#consent { display: none !important }", // cookie baner https://www.nperf.com
        ".cookie-notice { display: none !important }", // cookie banner https://tesa.com
        ".-locked body:before { position: initial !important } .-locked body:after { opacity: 1 !important; position: initial !important; background-color: initial !important }", // cookie banner https://tesa.com
        "#read-overlay-container { display: none !important }", // cookie banner background https://www.saechsische.de/
        "#rgpd-banner { display: none !important }", // cookie banner https://www.doktorwhatson.tv/
        "div[data-testid='cookie-policy-banner'] { display: none !important }", // cookie banner https://developers.facebook.com/docs/instagram-basic-display-api/
        "#_evh-ric { display: none !important }", // cookie banner https://www.la7.it/rivedila7
        "div[aria-label *= 'Cookie'], div[aria-label *= 'cookie'] { display: none !important }", // https://www.metoffice.gov.uk/about-us/press-office/news/weather-and-climate/2020/storm-bella-has-been-named
        "#piano_wrapper_unten { display: none !important }", // popup banner https://www.augsburger-allgemeine.de/panorama/Wann-ist-Ostern-2021-Termine-fuer-Ostersonntag-Ostermontag-und-Osterferien-id53722456.html
        ".privy-popup-content-wrap { display: none !important }", // newsletter signup https://oceansalive.co.uk/
        ".modal-backdrop { display: none !important }", // background overlay https://www.mantel.com
        "#tsj-bottom-ad { display: none !important }", // https://www.thesleepjudge.com/how-much-is-a-sleep-number-bed/
    ];

he's basiclly just adding a cs sfile in the dom with these matches.

i used crx reader to view source of chrome extnesion "remove cookie banners"

ralyodio commented 2 years ago

SO I found something easier....maybe add a flag or something to removeCookieBanners: true

all we need to do is run this:

        [...document.querySelectorAll('[class*="cookie" i]')].map(n => n.remove());
        [...document.querySelectorAll('[id*="cookie" i]')].map(n => n.remove());

that's going to delete almost everything that guy has in his extension.

b-fuze commented 2 years ago

I think that you're using a very old version of Readability? Not entirely sure, but you can import version v0.4.1 with the following

import { Readability } from "https://esm.sh/@mozilla/readability@0.4.1?no-check";

vs what seems to be a much older build of v0.2.1?

import Readability from "https://cdn.esm.sh/v57/moz-readability@0.2.1/es2021/moz-readability.js";

Which was last updated 2 years ago on NPM, vs 10 months ago for v0.4.1?

ralyodio commented 2 years ago

still hangs on a lot of doms from deno-dom parser.

b-fuze commented 2 years ago

@ralyodio run this exact code and tell me if you still get the same problem

import { Readability } from "https://esm.sh/@mozilla/readability@0.4.1?no-check";
import { DOMParser } from "https://deno.land/x/deno_dom@v0.1.18-alpha/deno-dom-wasm.ts";

const url =
  "https://www.bloomberg.com/news/articles/2021-11-13/apple-to-pay-30-million-over-security-checks-for-store-workers";
console.log(url);

const res = await fetch(url, {
  headers: {
    "User-Agent":
      "Mozilla/5.0 (X11; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0",
  },
});

if (!res.ok) {
  console.error(await res.text());
}

const body = await res.text();
const doc = new DOMParser().parseFromString(body, "text/html");

console.log("parsing article");
const article = new Readability(doc).parse();
console.log("articled parsed");
console.log(article);
gijsk commented 2 years ago

still hangs on a lot of doms from deno-dom parser.

At this point I think this is a deno-dom issue, and shouldn't be triaged further here, given the earlier comment where the deno-dom folks fixed something themselves.

chovyprognos commented 2 years ago

Yeah that code works now.

ralyodio commented 2 years ago

still hangs with puppetter page content though:

http://sprunge.us/5usCXb

this body causes parser to hang

ralyodio commented 2 years ago
import { Readability } from "https://esm.sh/@mozilla/readability@0.4.1?no-check";
import { DOMParser } from "https://deno.land/x/deno_dom@v0.1.18-alpha/deno-dom-wasm.ts";

const res = await fetch('http://sprunge.us/lS60yJ');
const body = await res.text();
const doc = new DOMParser().parseFromString(body, 'text/html');
console.log('parsing article');
const article = new Readability(doc).parse();
console.log('articled parsed!');
console.log(article);
gijsk commented 2 years ago

still hangs with puppetter page content though:

http://sprunge.us/5usCXb

this body causes parser to hang

This is running into the issue fixed in https://github.com/mozilla/readability/pull/694 , because apparently deno-dom also doesn't think that Node.childNodes should be a live array. It gets stuck trying to move all the children out of a DOM node because the DOM implementation lies and says the children are still there. That's really a bug in the DOM library, but anyway - if you use current git tip the problem goes away. We haven't done a point release with this fix yet.

If you still have more issues, please do more debugging - given it hangs in a single while loop, if you use --inspect with deno and then use the devtools in a chromium browser and attach, you can just break execution and see what's happening.