microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.35k stars 168 forks source link

Uncaught exceptions from third party scripts #613

Closed kylealwyn closed 1 year ago

kylealwyn commented 1 year ago

Prerequisites

Hi, thanks for the amazing lib. Have a question around hitting exceptions that look like the following after passing through a metascraper pipeline:

crawler:dev: error: uncaughtException: TextEncoder is not defined
crawler:dev: ReferenceError: TextEncoder is not defined
crawler:dev:     at https://ui-seo.crackedcdn.com/js/app.js?v=f0adbe077c9500ef01abee9e871f6045:2:18679
crawler:dev:     at s (https://ui-seo.crackedcdn.com/js/app.js?v=f0adbe077c9500ef01abee9e871f6045:2:2194)
crawler:dev:     at Generator._invoke (https://ui-seo.crackedcdn.com/js/app.js?v=f0adbe077c9500ef01abee9e871f6045

Or a variety of others such as IntersectionObserver or matchMedia.

Is there a way to disable running third party scripts?

Kikobeats commented 1 year ago

Hello, can you provide a way to reproduce this?

kylealwyn commented 1 year ago

Not sure I'll be able to get one tonight but will try as soon as I can. My hunch is that it's coming https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-helpers/index.js#L430 which is being used in the audio rule. I don't see any other reason why scripts from a third party cdn would be loaded and executed. I'm only fetching html from a proxy tool and running it through the pipeline. I'll also try removing the audio rule to see if I can isolate.

~~ Other note as I'm out the door, that theory might line up as I seem to encounter a much larger amount of errors on podcast syndication sites

Kikobeats commented 1 year ago

Unfortunately, I can't do anything there. I'm experiencing the same behavior but for fetch https://github.com/jsdom/jsdom/issues/3413

Just track that issue and as soon it's fixed we can land the behavior here too