stanford-oval / storm

An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.
http://storm.genie.stanford.edu
MIT License
12.22k stars 1.12k forks source link

[BUG] Outputting numerous 403 errors when running Co-STORM #193

Open ColtonBehannon opened 3 weeks ago

ColtonBehannon commented 3 weeks ago

Describe the bug When running the Co-STORM example, I get numerous 403 errors output in the terminal. These errors are then followed by some trafilatura errors and errors complaining about 'The API deployment for this resource does not exist'.

Despite all this, the final report is seemingly output just fine. The only issue is the terminal is impossible to follow as a result of the errors.

This issue is similar to #133 where I also commented as I experienced similar results in the past with STORM. I have tried multiple networks, and this has not had an impact.

Are the 403 errors a result of these sites not allowing scraping and hence not included in the final report?

To Reproduce Report following things

  1. Setup environment according to run_costorm_gpt.py
  2. Run it

Screenshots Error while requesting URL 403 image

followed by

_Trafilatura errors and 'An error occurred for text: root, ' with 404 code_ image

Environment:

shaoyijia commented 1 week ago

This is because for some urls, WebPageHelper fails to process them. This could because it cannot fetch the url content or fails to parse the content.

If you don't want to see the error, you could add the following code to the script to run

import logging

logging.basicConfig(level=logging.CRITICAL)

However, it's generally suggested to log the error/warning. If you don't want to see them in the console output, you can write them to a file. See https://docs.python.org/3/library/logging.html for more info.