jamesturk / scrapeghost

👻 Experimental library for scraping websites using OpenAI's GPT API.
https://jamesturk.github.io/scrapeghost/
Other
1.43k stars 87 forks source link

HallucinationChecker error #31

Closed ryandorward closed 1 year ago

ryandorward commented 1 year ago

This is a very promising and simple tool, thanks for sharing it!

I was playing with it, but I'm getting an the following error:

scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

full trace:

File "[...]/scraper.py", line 17, in <module>
    response = scraper(url)
               ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/scrapers.py", line 142, in scrape
    return self._apply_postprocessors(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/apicall.py", line 207, in _apply_postprocessors
    response = pp(response, self)
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapeghost/postprocessors.py", line 102, in __call__
    raise PostprocessingError(
scrapeghost.errors.PostprocessingError: HallucinationChecker expecting a dict, ensure JSONPostprocessor or equivalent is used first.

My scraper.py code is close to the tutorial example, but maybe I'm doing it wrong:

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://www.boredpanda.com/bruce-lee-quotes"
schema = {
    "index": "int",
    "quote": "str",
}

scraper = SchemaScraper(
    schema,
    extra_preprocessors=[CSS(".open-list-items div.open-list-item:nth-child(-n+10) .bordered-description")],
)

response = scraper(url)
pprint(response.data)

I tried to disable the HallucinationChecker by overriding the postprocessors but it wasn't clear to me how to do that properly.

Thanks again for your work on this, it's very cool and exciting!

jamesturk commented 1 year ago

Ah yeah it naively was making the assumption the response was a dictionary not a list, I've just fixed this & will be fixed in 0.4.2.

ryandorward commented 1 year ago

Thanks for fixing it 👍