typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
97 stars 36 forks source link

Writing non-200 messages to the console #14

Open lanegoolsby opened 2 years ago

lanegoolsby commented 2 years ago

I am trying to crawl a site that randomly returns an error response code. What I suspect is happening is the site is returning a HTTP 403 response code with an empty payload because the site itself has request rate throttling enabled. However, I can't confirm that because the crawler does not provide enough detail to confirm or deny.

Is there a way to get move verbose messages? If not, could this be added?

I am running the crawler as a Docker image on Mac.

Here's the error message I receive.

DEBUG:typesense.api_call:our.internal.typesense.server:443 is healthy. Status code: 400
ERROR:scrapy.core.scraper:Spider error processing <GET https://our.internal.site/some/path/> (referer: None)
Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/root/src/documentation_spider.py", line 177, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/root/src/documentation_spider.py", line 149, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/root/src/typesense_helper.py", line 63, in add_records
    transformed_records[i:i + 50])
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/documents.py", line 56, in import_
    api_response = self.api_call.post(self._endpoint_path('import'), docs_import, params, as_json=False)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 145, in post
    timeout=self.config.connection_timeout_seconds)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 113, in make_request
    error_message = r.json().get('message', 'API error.')
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/models.py", line 900, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)