Open mishushakov opened 4 months ago
Looking into this, it's hn's CSP that's blocking loading scripts from skypack. Looking into the console:
Refused to load the script 'https://cdn.skypack.dev/@mozilla/readability' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/". Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.
So setting the page up with options like: { bypassCSP: true } completely resolves the issue. e.g.:
{ bypassCSP: true }
https://playwright.dev/docs/api/class-browser#browser-new-page-option-bypass-csp
// Open new page const page = await browser.newPage({ bypassCSP: true }) await page.goto('https://news.ycombinator.com') // Define schema to extract contents into const schema = z.object({ top: z .array( z.object({ title: z.string(), points: z.number(), by: z.string(), commentsURL: z.string(), }) ) .length(5) .describe('Top 5 stories on Hacker News'), }) // Run the scraper const { data } = await scraper.run(page, schema, { format: 'text', }) // Show the result from LLM console.log(data.top)
Example output I'd get rn:
[ { title: "Crowdstrike Update: Windows Bluescreen and Boot Loops", points: 2126, by: "BLKNSLVR", commentsURL: "https://reddit.com", }, { title: "FCC votes unanimously to dramatically limit prison telecom charges", points: 293, by: "Avshalom", commentsURL: "https://worthrises.org", }, { title: "Foliate: Read e-books in style, navigate with ease", points: 330, by: "ingve", commentsURL: "https://johnfactotum.github.io", }, { title: "Want to spot a deepfake? Look for the stars in their eyes", points: 65, by: "jonbaer", commentsURL: "https://ras.ac.uk", }, { title: "Startups building balloons to hoist tourists 100k feet into the stratosphere", points: 15, by: "amichail", commentsURL: "https://cnbc.com", } ]
So I suppose it'd be sufficient to document this behavior as a known constraint on some sites, when using text mode with Readable.js?
We should get rid of Readable.js in favour of html2text
Looking into this, it's hn's CSP that's blocking loading scripts from skypack. Looking into the console:
So setting the page up with options like:
{ bypassCSP: true }
completely resolves the issue. e.g.:https://playwright.dev/docs/api/class-browser#browser-new-page-option-bypass-csp
Example output I'd get rn: