mishushakov / llm-scraper

Turn any webpage into structured data using LLMs
MIT License
2.41k stars 147 forks source link

Zod TypeValidationError #29

Closed RayRama closed 4 months ago

RayRama commented 4 months ago

When I try to use llama3 locally, I got some error like this

image

I used barely example from the readme file and changed the LLM provider to llama3

import { chromium } from "playwright";
import { z } from "zod";

import { ollama } from "ollama-ai-provider";
import LLMScraper from "llm-scraper";

const browser = await chromium.launch();

const llm = ollama("llama3");

const scraper = new LLMScraper(llm);

const page = await browser.newPage();
await page.goto("https://news.ycombinator.com");

const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe("Top 5 stories on Hacker News"),
});

// Run the scraper
const { data } = await scraper.run(page, schema, {
  format: "html",
});

// Show the result from LLM
console.log(data.top);

await page.close();
await browser.close();
{
  "dependencies": {
    "llm-scraper": "^1.2.1",
    "ollama-ai-provider": "^0.10.0",
    "playwright": "^1.45.1",
    "zod": "^3.23.8"
  }
}
mishushakov commented 4 months ago

I believe the model in question could not generate correct output structure - that's why validation failed. My suggestion would be to try a different model or a different structure mode ('auto' | 'json' | 'tool' | 'grammar')

sam-roberts commented 3 months ago

As someone new to the repo it ends up being a bit confusing. On one hand the examples or documentation doesn't immediately suggest popular models like ollama ->llama3 would be running into problems.

Of course its understandable to not keep track of every single model that is out there, but for self hosted options perhaps recommendations could be given of what is tested to at least be a starting point.

for example I tried all 4 structure modes and none of them worked with llama3:8b....if that is a dead end and this model doesn't work with the tool how do we go about figuring out which does work apart from trial and error?

mishushakov commented 3 months ago

Try the new llama3.1, it has built-in tool use and in my experience generates correct json most of the time