typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

After rerun docsearch-scraper there is status code 404 on delete previous collection #26

Closed jasiek-net closed 6 months ago

jasiek-net commented 1 year ago

Description

When I try to rescrape documentation, I've got error at the end (404).

Steps to reproduce

Scrape documentation in docusaurus once, then rerun scraper

docker run -p 8080:8080 -it --env-file=./.env -e "CONFIG=$(cat ./typesense.json | jq -r tostring)" typesense/docsearch-scraper

Expected Behavior

Replace existing docs with new one.

Actual Behavior

Error on delete collection

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/seleuser/src/index.py", line 116, in <module>
    run_config(environ['CONFIG'])
  File "/home/seleuser/src/index.py", line 104, in run_config
    typesense_helper.commit_tmp_collection()
  File "/home/seleuser/src/typesense_helper.py", line 90, in commit_tmp_collection
    self.typesense_client.collections[old_collection_name].delete()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/typesense/collection.py", line 22, in delete
    return self.api_call.delete(self._endpoint_path())
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/typesense/api_call.py", line 158, in delete
    return self.make_request(requests.delete, endpoint, True,
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/typesense/api_call.py", line 115, in make_request
    raise ApiCall.get_exception(r.status_code)(r.status_code, error_message)
typesense.exceptions.ObjectNotFound: [Errno 404] No collection with name `xxx` found.

Metadata

Typesense Version: 0.24

OS: macbook

jasonbosco commented 1 year ago

Looks like the previous collection might have been deleted manually outside of the scraper...

To fix this, you want to delete the alias and the collection created by the scraper via the API, and then then re-run the scraper.

nascode commented 1 year ago

@jasonbosco I experience this issue too.

I am using Typesense cloud, and the scraping process can only be done succesfully when I deleted all collection and alias.

Then next attempt to scrape will always fail (Delete collection not found)

jasonbosco commented 1 year ago

@nascode The next time this happens, could you look at the Alias section in your Typesense Cloud dashboard and let me know if the collection that the alias is pointing to exists?

nascode commented 1 year ago

@jasonbosco

Before running the scraper, here are my collection list and alias

image image

Then I tried to run the scraper and got this error

typesense.exceptions.ObjectNotFound: [Errno 404] No collection with name `service-bridge-index_1689120603` found.
jasonbosco commented 1 year ago

Could you show me similar screenshots of the alias screen and the collection selector after you run the scraper?

Also could you double-check that there is only one instance of the scraper running?

timolagus commented 6 months ago

We had a similar issue with Typesense DocSearch scraper 0.8.0 and our locally hosted Typesense server (0.25.1). Here's a log example matching the one reported by @jasiek-net.

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/seleuser/src/index.py", line 138, in <module>
    run_config(environ['CONFIG'])
  File "/home/seleuser/src/index.py", line 126, in run_config
    typesense_helper.commit_tmp_collection()
  File "/home/seleuser/src/typesense_helper.py", line 105, in commit_tmp_collection
    self.typesense_client.collections[old_collection_name].delete()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/typesense/collection.py", line 22, in delete
    return self.api_call.delete(self._endpoint_path())
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/typesense/api_call.py", line 158, in delete
    return self.make_request(requests.delete, endpoint, True,
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/typesense/api_call.py", line 115, in make_request
    raise ApiCall.get_exception(r.status_code)(r.status_code, error_message)
typesense.exceptions.ObjectNotFound: [Errno 404] No collection with name `main_1706279577` found.

The interesting stuff happened just before the above when the scraper requested Typesense to delete the previous collection (in this case, main_1706279577) for the current alias (main):

INFO:scrapy.core.engine:Spider closed (finished)
DEBUG:typesense.api_call:Making get /aliases/main
DEBUG:typesense.api_call:Try 1 to node 192.168.100.115:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 192.168.100.115:8108
DEBUG:urllib3.connectionpool:http://192.168.100.115:8108/ "GET /aliases/main HTTP/1.1" 200 None
DEBUG:typesense.api_call:192.168.100.115:8108 is healthy. Status code: 200
DEBUG:typesense.api_call:Making put /aliases/main
DEBUG:typesense.api_call:Try 1 to node 192.168.100.115:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 192.168.100.115:8108
DEBUG:urllib3.connectionpool:http://192.168.100.115:8108/ "PUT /aliases/main HTTP/1.1" 200 None
DEBUG:typesense.api_call:192.168.100.115:8108 is healthy. Status code: 200
DEBUG:typesense.api_call:Making delete /collections/main_1706279577
DEBUG:typesense.api_call:Try 1 to node 192.168.100.115:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 192.168.100.115:8108
DEBUG:typesense.api_call:Request to 192.168.100.115:8108 failed because of HTTPConnectionPool(host='192.168.100.115', port=8108): Read timed out. (read timeout=3.0)
DEBUG:typesense.api_call:Sleeping for 1.0 and retrying...
DEBUG:typesense.api_call:No healthy nodes were found. Returning the next node.
DEBUG:typesense.api_call:Try 2 to node 192.168.100.115:8108 -- healthy? False
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 192.168.100.115:8108
DEBUG:urllib3.connectionpool:http://192.168.100.115:8108/ "DELETE /collections/main_1706279577 HTTP/1.1" 404 None
DEBUG:typesense.api_call:192.168.100.115:8108 is healthy. Status code: 404
Traceback (most recent call last):
...
typesense.exceptions.ObjectNotFound: [Errno 404] No collection with name `main_1706279577` found.

The previous collection (main_1706279577) was actually successfully deleted on try 1 (we checked in Typesense), but the scraper timed out before the delete request completed. When the scraper then retried the delete request, the collection was already gone and try 2 terminated in the typesense.exceptions.ObjectNotFound error.

We just now switched to scraper version 0.9.1, which includes the timeout increase update, and everything's working fine thus far.

PS. The scraper started timing out only last week (after our Typesense had been running for two months or so). We have a relatively small database with some few dozen smallish collections, and our server resources are perfectly sufficient for the work, so we have no idea why Typesense suddenly started taking so long in deleting a single collection.

jasonbosco commented 6 months ago

@timolagus Thank you for those details. I increased the timeout for all write operations in v0.9.1 of the scraper to solve for a different use-case, but that would definitely help with deletes timing out as well.

We've also made some improvements to the collection delete performance in v0.25.2 of Typesense Server, which should also help prevent this issue.