typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Keep previous versions available in Dockerhub #28

Closed lanegoolsby closed 1 year ago

lanegoolsby commented 1 year ago

Several of our deployment pipelines are currently hung because of #27. However we can't roll back to the previous version because only the latest version of the scraper is published to Docker.

Please adjust the deployment process so that historical versions are persisted. That way consumers can roll back in the event of a problem.

jasonbosco commented 1 year ago

Yes definitely. I had this on my todo list to tackle later, but later never came... until now unfortunately.

I've now updated the release process to release tagged images going forward.

Meanwhile, I tried building the 0.3.4 code base, but it looks like there have been some deprecations, so I had to make some small tweaks to get it to build again. I've pushed this out as typesense/docsearch-scraper:0.3.5. Could you give that a shot and let me know if it works as expected?

lanegoolsby commented 1 year ago

Now I am getting HTTP timeouts. I verified out cluster is up and healthy.

Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1377, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='our.typensense.url.com, port=443): Read timed out. (read timeout=3.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/src/index.py", line 116, in <module>
    run_config(environ['CONFIG'])
  File "/root/src/index.py", line 43, in run_config
    typesense_helper.create_tmp_collection()
  File "/root/src/typesense_helper.py", line 30, in create_tmp_collection
    self.typesense_client.collections[self.collection_name_tmp].delete()
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/collection.py", line 22, in delete
    return self.api_call.delete(self._endpoint_path())
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 159, in delete
    params=params, timeout=self.config.connection_timeout_seconds)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 129, in make_request
    raise last_exception
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/typesense/api_call.py", line 103, in make_request
    r = fn(url, headers={ApiCall.API_KEY_HEADER_NAME: self.config.api_key}, **kwargs)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/api.py", line 161, in delete
    return request('delete', url, **kwargs)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='our.typensense.url.com, port=443): Read timed out. (read timeout=3.0)

Exited with code exit status 1
jasonbosco commented 1 year ago

Could you exec bash into the container and try curling your Typesense host's health endpoint? This feels like some sort of docker network setup issue...

lanegoolsby commented 1 year ago

I get a 301 if I curl from the CircleCI host.

(note: tsUrl is just a temporary env var I created that is equal to our.typensense.url.com above)

image

I am able to hit the health endpoint from my local machine. image

Everything was working yesterday so this too is almost certainly related to the image.

jasonbosco commented 1 year ago

That's so strange! Wonder where the 301 is coming from.

Could you share the output of curl -svo /dev/null $tsUrl?

lanegoolsby commented 1 year ago

lol, forgot https://, mea culpa.

I can curl from the CircleCI pod image

jasonbosco commented 1 year ago

Would you be able share a minimal CircleCI configuration that replicates the issue? I'm curious to see how the scraper image gets referenced, if it's using a machine executor or a docker executor, how the networking is setup, etc.

Also, do you use the scraper image directly, or do you build another image based off of the scraper image and use that?

(I still haven't ruled out that this is not an issue with the image - just want to be able to replicate it consistently on my side to be able to understand the root cause).

lanegoolsby commented 1 year ago

We have the crawler built into a CircleCI orb so I'm translating it to a 'normal' job as best I can. It may require some tweaking.

executors:
  typesense: 
    docker:
      - image: $hubUrl/typesense/docsearch-scraper #:0.3.5

commands:
  crawl:
    parameters:
      apiKey:
        type: env_var_name
        description: Environment variable for the Typesense key
        default: TYPESENSE_API_KEY
      config:
        type: string
        description: Path to a JSON file that tells the crawler how to parse a site's structure.
      env:
        type: enum
        default: np
        enum: ["np", "prod"]
        description: Typensense environment, defaults to "np" (non-prod)
    steps:
      - run:
          name: Install dependencies
          command: |
            apt-get update && apt-get install -y git openssh-client
      - run:
          name: Crawl the site
          command: |
            export TYPESENSE_HOST=$([[ << parameters.env >> = "prod" ]] && echo "prodUrl" || echo "nonProdUrl")
            export TYPESENSE_API_KEY="$<< parameters.apiKey >>"
            export CONFIG=$(cat << parameters.config >>)
            cd /root
            pipenv run python -m src.index

jobs:
  crawl_nonprod:
    executor: typesense
    steps: 
      - crawl:
          apiKey: superSecure
          config: /path/to/yadda.json
          env: np
lanegoolsby commented 1 year ago

Okay, after a bit of futzing and cussing I was able to get things working on my end with the 3.5 rollback!

I think Circle was holding on to a previous attempt of me trying to use sudo for the apt install or something.

The issue in #27 still persists but I can at least unblock my pipelines now!

jasonbosco commented 1 year ago

Phew glad to hear that! 😅

I'll keep you posted on the other one.