mishushakov / llm-scraper

Turn any webpage into structured data using LLMs
MIT License
2.42k stars 147 forks source link

Bug: `tool_calls` is sometimes undefined #20

Open Ademsk1 opened 6 months ago

Ademsk1 commented 6 months ago

When attempting to access content that might be blocked, I'd like to safely handle this. When doing so however I come across the following error which crashes my server:

...ab/src/node_modules/llm-scraper/dist/models.js:41
  const c = completion.choices[0].message.tool_calls[0].function.arguments;
                                                    ^
TypeError: Cannot read properties of undefined (reading '0')
    at generateOpenAICompletions

Digging into the response of the completion.choices we see something like:

[
  {"index":0,
  "message":
    {"role":"assistant",
     "content":"The content you provided shows that access to the requested webpage has been blocked due to security measures implemented by Cloudflare, likely triggered by specific actions or commands deemed suspicious. This type of response is commonly served when automated systems (like web scrapers) or aggressive browsing behaviors are detected. There is no job-related content or other typical webpage elements displayed in the provided HTML. Instead, it provides information about why the access was denied, suggesting methods to resolve the issue such as contacting the site owner."
    },
  "logprobs":null,
  "finish_reason":"stop"
  }
]

My schema description contains this at the end:

If the content is inaccessible, e.g. behind a paywall, or has been blocked, the scraper will describe the error in the error field, and the appropriate status code (e.g. 401: Unauthorized, or 403: Forbidden).

Could my schema be affecting the completion content? Here's also the code that I use. Wrapping in try doesn't seem to do much.

try {
    const openai = initialise()
    const browser = await chromium.launch();
    const scraper = new LLMScraper(browser, openai);
    const pages = await scraper.run(url, {
      model: "gpt-4-turbo",
      schema,
      mode: "html",
      closeOnFinish: true,
    })
    const stream = []
    for await (const page of pages) {
      stream.push(page)
    }
    console.log(stream[0].data)
    return stream[0].data