mishushakov / llm-scraper

Turn any webpage into structured data using LLMs
MIT License
2.17k stars 140 forks source link

Fix Readable.js on certain pages (like HN) #8

Open mishushakov opened 4 months ago

blanky0230 commented 1 month ago

Looking into this, it's hn's CSP that's blocking loading scripts from skypack. Looking into the console:

Refused to load the script 'https://cdn.skypack.dev/@mozilla/readability' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/". Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.

So setting the page up with options like: { bypassCSP: true } completely resolves the issue. e.g.:

https://playwright.dev/docs/api/class-browser#browser-new-page-option-bypass-csp

// Open new page
const page = await browser.newPage({ bypassCSP: true })
await page.goto('https://news.ycombinator.com')

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

// Run the scraper
const { data } = await scraper.run(page, schema, {
  format: 'text',
})

// Show the result from LLM
console.log(data.top)

Example output I'd get rn:

[
  {
    title: "Crowdstrike Update: Windows Bluescreen and Boot Loops",
    points: 2126,
    by: "BLKNSLVR",
    commentsURL: "https://reddit.com",
  }, {
    title: "FCC votes unanimously to dramatically limit prison telecom charges",
    points: 293,
    by: "Avshalom",
    commentsURL: "https://worthrises.org",
  }, {
    title: "Foliate: Read e-books in style, navigate with ease",
    points: 330,
    by: "ingve",
    commentsURL: "https://johnfactotum.github.io",
  }, {
    title: "Want to spot a deepfake? Look for the stars in their eyes",
    points: 65,
    by: "jonbaer",
    commentsURL: "https://ras.ac.uk",
  }, {
    title: "Startups building balloons to hoist tourists 100k feet into the stratosphere",
    points: 15,
    by: "amichail",
    commentsURL: "https://cnbc.com",
  }
]
blanky0230 commented 1 month ago

So I suppose it'd be sufficient to document this behavior as a known constraint on some sites, when using text mode with Readable.js?

mishushakov commented 1 month ago

We should get rid of Readable.js in favour of html2text