I am trying to crawl a site that randomly returns an error response code. What I suspect is happening is the site is returning a HTTP 403 response code with an empty payload because the site itself has request rate throttling enabled. However, I can't confirm that because the crawler does not provide enough detail to confirm or deny.
Is there a way to get move verbose messages? If not, could this be added?
I am running the crawler as a Docker image on Mac.
Here's the error message I receive.
DEBUG:typesense.api_call:our.internal.typesense.server:443 is healthy. Status code: 400
ERROR:scrapy.core.scraper:Spider error processing <GET https://our.internal.site/some/path/> (referer: None)
Traceback (most recent call last):
File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/root/src/documentation_spider.py", line 177, in parse_from_start_url
self.add_records(response, from_sitemap=False)
File "/root/src/documentation_spider.py", line 149, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/root/src/typesense_helper.py", line 63, in add_records
transformed_records[i:i + 50])
File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/documents.py", line 56, in import_
api_response = self.api_call.post(self._endpoint_path('import'), docs_import, params, as_json=False)
File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 145, in post
timeout=self.config.connection_timeout_seconds)
File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 113, in make_request
error_message = r.json().get('message', 'API error.')
File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/models.py", line 900, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I am trying to crawl a site that randomly returns an error response code. What I suspect is happening is the site is returning a HTTP 403 response code with an empty payload because the site itself has request rate throttling enabled. However, I can't confirm that because the crawler does not provide enough detail to confirm or deny.
Is there a way to get move verbose messages? If not, could this be added?
I am running the crawler as a Docker image on Mac.
Here's the error message I receive.