mishushakov / llm-scraper

Turn any webpage into structured data using LLMs
MIT License
2.14k stars 139 forks source link

The example code, doesn't work (hacker news articles) #34

Open moda20 opened 1 month ago

moda20 commented 1 month ago

Hi, i tried the example code to see if the scraper works but it always return a validation error for attribute top which is supposed ton be an array.

here are my example code, a bit tweaked to use local ollama :

import { chromium } from 'playwright'
import { z } from 'zod'
import LLMScraper from 'llm-scraper'
import { ollama } from 'ollama-ai-provider'

// Launch a browser instance
const browser = await chromium.launch()

// Initialize LLM provider
const llm = ollama('llama3', {

})
llm.config.baseURL = 'http://localhost:7869/api'

// Create a new LLMScraper
const scraper = new LLMScraper(llm)

// Open new page
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')

// Define schema to extract contents into
const schema = z.object({
    top: z
        .array(
            z.object({
                title: z.string(),
                points: z.number(),
                by: z.string(),
                commentsURL: z.string(),
            })
        )
        .length(5)
        .describe('Top 5 stories on Hacker News'),
})

// Run the scraper
const { data } = await scraper.run(page, schema, {
    format: 'html',
})

// Show the result from LLM
console.log(data.top)

await page.close()
await browser.close()

the error log :

      error: new TypeValidationError({
             ^

TypeValidationError [AI_TypeValidationError]: Type validation failed: Value: {"title":"Ask HN: Where to find the cheapest proxies for web scraping?","url":"item?id=41023251","points":8,"user":"aw123","time_ago":"3 hours ago","comments_count":2}.
Error message: [
  {
    "code": "invalid_type",
    "expected": "array",
    "received": "undefined",
    "path": [
      "top"
    ],
    "message": "Required"
  }
]
    at safeValidateTypes (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/@ai-sdk/provider-utils/dist/index.mjs:205:14)
    at safeParseJSON (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/@ai-sdk/provider-utils/dist/index.mjs:248:12)
    ... 3 lines matching cause stack trace ...
    at async file:///Users/medmansour/Documents/personalProjects/llm-scrapper/index.js:39:18 {
  cause: ZodError: [
    {
      "code": "invalid_type",
      "expected": "array",
      "received": "undefined",
      "path": [
        "top"
      ],
      "message": "Required"
    }
  ]
      at get error [as error] (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/zod/lib/index.mjs:587:31)
      at safeValidateTypes (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/@ai-sdk/provider-utils/dist/index.mjs:207:33)
      at safeParseJSON (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/@ai-sdk/provider-utils/dist/index.mjs:248:12)
      at generateObject (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/ai/dist/index.mjs:689:23)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async generateAISDKCompletions (file:///Users/medmansour/Documents/personalProjects/llm-scrapper/node_modules/llm-scraper/dist/models.js:20:20)
      at async file:///Users/medmansour/Documents/personalProjects/llm-scrapper/index.js:39:18 {
    issues: [
      {
        code: 'invalid_type',
        expected: 'array',
        received: 'undefined',
        path: [ 'top' ],
        message: 'Required'
      }
    ],
    addIssue: [Function (anonymous)],
    addIssues: [Function (anonymous)],
    errors: [
      {
        code: 'invalid_type',
        expected: 'array',
        received: 'undefined',
        path: [ 'top' ],
        message: 'Required'
      }
    ]
  },
  value: {
    title: 'Ask HN: Where to find the cheapest proxies for web scraping?',
    url: 'item?id=41023251',
    points: 8,
    user: 'aw123',
    time_ago: '3 hours ago',
    comments_count: 2
  }
}

the value seems to be returned, but not as top and an array,

mishushakov commented 1 month ago

Please report this issue on Vercel AI SDK: https://github.com/vercel/ai

moda20 commented 1 month ago

@mishushakov why ? is the zod package using the Vercel AI ?