spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.16k stars 100 forks source link

JSONDecodeError when crawling website with specific parameters #227

Closed tarujg closed 1 week ago

tarujg commented 1 week ago

Description

Encountered a JSONDecodeError when attempting to crawl a website using spider with llama_index integration. The error occurs during the API POST request processing, specifically when trying to decode the JSON response.

Steps to Reproduce

  1. Using spider with llama_index's web reader
  2. Attempting to crawl with the following parameters:
    params = {
    'limit': 2000,
    'return_format': 'markdown',
    'readability': True,
    'metadata': True,
    'stealth': True,
    'cache': False,
    'depth': 10,
    'request_timeout': 10,
    'request': 'smart',
    'respect_robots': True
    }

Error Details

requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The error propagates through the following chain:

  1. Initial error in requests.models.py when attempting to decode JSON
  2. Caught and re-raised in spider.py during _handle_error
  3. Finally surfaces in spider_web.py during the crawl operation

Environment

Additional Context

The error suggests the API response is empty or not valid JSON format. This could indicate either:

j-mendez commented 1 week ago

Hi this is the spider engine unrelated.

tarujg commented 1 week ago

@j-mendez should I create it in the llama-index-readers?

tarujg commented 1 week ago

https://github.com/run-llama/llama_index/issues/16946