typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
93 stars 33 forks source link

RuntimeError("cannot join thread before it is started") #25

Open dtlhlbs opened 1 year ago

dtlhlbs commented 1 year ago

Description

New and old CI jobs running Docker image typesense/docsearch-scraper are failing with RuntimeError("cannot join thread before it is started")

This is also failing old jobs that previously ran, so I think it's the implicit use of typesense/docsearch-scraper:latest, being 0.4.0 that was just put up. Since there's no tag - can we get a tag for typesense/docsearch-scraper:0.3.4 so I can pin to that?

Steps to reproduce

I am running the following command in Gitlab CI using a container based on Alpine

docker run --network container:tmpdocs-$CI_JOB_ID --name scraper-$CI_JOB_ID -d --env-file=deploy/typesense.env -e "CONFIG=$(cat deploy/DocSearch.config.json | jq -r tostring)" typesense/docsearch-scraper:0.3.4

Expected Behavior

Scraper will index site on the same host

Actual Behavior

Errors with last error RuntimeError("cannot join thread before it is started")

Metadata

Typesense Version: 'latest'

OS: Alpine Linux v3.12

docsearch-scraper.log DocSearch.config.json.txt

jasonbosco commented 1 year ago

@dtlhlbs I can't seem to replicate this, since the scraper runs successfully on the Typesense docs site for eg...

Here's the docker tag for the previous version of the scraper that you should be able to pin to, until we get to the bottom of this issue: ~da0868e8ca7e2232abdd748d3fd808e2d338e33ab39229acee2990569489fa97~ See this

jasonbosco commented 1 year ago

I can't seem to replicate this. I spun up a brand new VM (Intel CPU, Amazon Linux), installed docker on it and then ran the scraper on the Typesense docs site and it worked fine for me:

$ docker run -it --env-file=.env -e CONFIG="{\"index_name\":\"typesense_docs\",\"start_urls\":[{\"url\":\"https://typesense.org/docs/(?P<version>.*?)/\",\"variables\":{\"version\":[\"0.24.0\",\"0.23.1\",\"0.23.0\",\"0.22.2\",\"0.22.1\",\"0.22.0\",\"0.21.0\",\"0.20.0\",\"0.19.0\",\"0.18.0\",\"0.17.0\",\"0.16.1\",\"0.16.0\",\"0.15.0\",\"0.14.0\",\"0.13.0\",\"0.12.0\",\"0.11.2\"]}},{\"url\":\"https://typesense.org/docs/overview/\"},{\"url\":\"https://typesense.org/docs/guide/\"}],\"selectors\":{\"default\":{\"lvl0\":\".content__default h1\",\"lvl1\":\".content__default h2\",\"lvl2\":\".content__default h3\",\"lvl3\":\".content__default h4\",\"lvl4\":\".content__default h5\",\"text\":\".content__default p, .content__default ul li, .content__default table tbody tr\"}},\"scrape_start_urls\":false,\"strip_chars\":\" .,;:#\"}" typesense/docsearch-scraper
Unable to find image 'typesense/docsearch-scraper:latest' locally
latest: Pulling from typesense/docsearch-scraper
677076032cca: Pull complete
3026efbcce37: Pull complete
b83c999f3ae6: Pull complete
4f4fb700ef54: Pull complete
4d02e570415e: Pull complete
fe9dd39ad932: Pull complete
40bdd8cbcb60: Pull complete
330e95c637fc: Pull complete
1c4235bc81bd: Pull complete
f636e29df4a6: Pull complete
2ee46e1d6efd: Pull complete
f2a90558593e: Pull complete
f7cb19d7ba62: Pull complete
b51fd8a46836: Pull complete
72e3879aa441: Pull complete
b656e2665916: Pull complete
95462c1394e2: Pull complete
0a6c9231c464: Pull complete
02b4a1743fdf: Pull complete
fcb6abf81668: Pull complete
066a7661e7fb: Pull complete
b1349c66a67d: Pull complete
cb04953d313a: Pull complete
83cfbae1faa8: Pull complete
4aa2727acdc6: Pull complete
Digest: sha256:ffce60fae1358cfe8ba8a59a50b24dfd835610e543b5fbadba5a84541f7e8b2f
Status: Downloaded newer image for typesense/docsearch-scraper:latest
INFO:scrapy.utils.log:Scrapy 2.8.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Linux-5.10.165-143.735.amzn2.x86_64-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'ERROR',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Algolia DocSearch Crawler'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
  warn(

WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  self.fingerprinter = fingerprinter or RequestFingerprinter()

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.24.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.21.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.2/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.18.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.19.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.17.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.15.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.14.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.13.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.12.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/guide/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.11.2/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/overview/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.20.0/> (referer: None)
DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET https://typesense.org/docs/0.24.0/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.0/api/> (referer: https://typesense.org/docs/0.23.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.1/api/> (referer: https://typesense.org/docs/0.22.1/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.18.0/api/> (referer: https://typesense.org/docs/0.18.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.2/api/> (referer: https://typesense.org/docs/0.22.2/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.21.0/api/> (referer: https://typesense.org/docs/0.21.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.1/api/> (referer: https://typesense.org/docs/0.23.1/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.24.0/api/> (referer: https://typesense.org/docs/0.24.0/)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.23.0/api/ 54 records)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.12.0/api/> (referer: https://typesense.org/docs/0.12.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.19.0/api/> (referer: https://typesense.org/docs/0.19.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.13.0/api/> (referer: https://typesense.org/docs/0.13.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.14.0/api/> (referer: https://typesense.org/docs/0.14.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.15.0/api/> (referer: https://typesense.org/docs/0.15.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.0/api/> (referer: https://typesense.org/docs/0.16.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.0/api/> (referer: https://typesense.org/docs/0.22.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.1/api/> (referer: https://typesense.org/docs/0.16.1/)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.22.1/api/ 51 records)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.18.0/api/ 6 records)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.22.2/api/ 55 records)
.
.
.
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 77769,
 'downloader/request_count': 277,
 'downloader/request_method_count/GET': 277,
 'downloader/response_bytes': 2140857,
 'downloader/response_count': 277,
 'downloader/response_status_count/200': 276,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 10089,
 'elapsed_time_seconds': 453.931499,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 2, 22, 16, 54, 6, 607301),
 'httpcompression/response_bytes': 11532228,
 'httpcompression/response_count': 277,
 'memusage/max': 120295424,
 'memusage/startup': 68550656,
 'request_depth_max': 3,
 'response_received_count': 277,
 'scheduler/dequeued': 277,
 'scheduler/dequeued/memory': 277,
 'scheduler/enqueued': 277,
 'scheduler/enqueued/memory': 277,
 'start_time': datetime.datetime(2023, 2, 22, 16, 46, 32, 675802)}
INFO:scrapy.core.engine:Spider closed (finished)

DEBUG:typesense.api_call:Making get /aliases/typesense_docs
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "GET /aliases/typesense_docs HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making put /aliases/typesense_docs
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "PUT /aliases/typesense_docs HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making delete /collections/typesense_docs_1677081767
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "DELETE /collections/typesense_docs_1677081767 HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
Nb hits: 9097
dtlhlbs commented 1 year ago

@jasonbosco Thanks for looking at this. I have managed to build a 0.3.4 image and pushed this to our own registry and that's got us up and running again. I couldn't pull the docker tag you mentioned due to manifest not found/manifest unknown.

I think it's likely something related to the CI environment. It's running docker in docker, building a docusaurus container including typesense, running that container, then spinning up the scraper to scrape the docusaurus site before imaging the results and deploying the docusaurus container with updated index.

I'd have to recreate this environment and the problem, maybe via docker compose and send that through. Maybe until then we see if anyone else sees this error.

Increasing the CPU and memory capacity of the host didn't help.

jasonbosco commented 1 year ago

I see, thank you for that additional context.

Could you check if this build works: https://github.com/typesense/typesense-docsearch-scraper/issues/28#issuecomment-1440908978?

dtlhlbs commented 1 year ago

@jasonbosco yes, typesense/docsearch-scraper:0.3.5 is working for me thanks :)

jasonbosco commented 1 year ago

May I know what version of Docker engine you're using?

dtlhlbs commented 1 year ago

Docker version 20.10.9, build c2ea9bc

Markeli commented 1 year ago

I've got the same error. After upgrading docker to the latest version error has gone. Docker version before update: 18.09.1, build 4c52b90 Docker version after update: 23.0.2, build 569dd73