typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Error when running docsearch-scraper #30

Closed arrondev closed 1 year ago

arrondev commented 1 year ago

Description

I am trying to run the docsearch-scraper for docusarus following the provided instructions. I am running into this error and I can't tell what is wrong. Can someone point me to what can potentially be wrong?

docker run -it --env-file=./.env -e "CONFIG=$(cat ./config.json | jq -r tostring)" typesense/docsearch-scraper

INFO:scrapy.utils.log:Scrapy 2.8.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'ERROR',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Algolia DocSearch Crawler'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
  warn(

WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  self.fingerprinter = fingerprinter or RequestFingerprinter()

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
ERROR:test:Failure without response can't start new thread
2023-03-27 17:13:48 [test] ERROR: Failure without response can't start new thread
ERROR:test:Failure without response can't start new thread
2023-03-27 17:13:48 [test] ERROR: Failure without response can't start new thread
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/builtins.RuntimeError': 2,
 'downloader/request_bytes': 437,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'elapsed_time_seconds': 0.218713,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 3, 27, 17, 13, 48, 339798),
 'log_count/ERROR': 2,
 'memusage/max': 56217600,
 'memusage/startup': 56217600,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2023, 3, 27, 17, 13, 48, 121085)}
INFO:scrapy.core.engine:Spider closed (finished)
Unhandled Error
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 501, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 532, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 512, in addCallbacks
    self._runCallbacks()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 513, in _continueFiring
    callable(*args, **kwargs)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 1082, in _stopThreadPool
    self.threadpool.stop()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/python/threadpool.py", line 275, in stop
    thread.join()
  File "/usr/lib/python3.10/threading.py", line 1091, in join
    raise RuntimeError("cannot join thread before it is started")
builtins.RuntimeError: cannot join thread before it is started

CRITICAL:twisted:Unhandled Error
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 501, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 532, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 512, in addCallbacks
    self._runCallbacks()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 513, in _continueFiring
    callable(*args, **kwargs)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 1082, in _stopThreadPool
    self.threadpool.stop()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/python/threadpool.py", line 275, in stop
    thread.join()
  File "/usr/lib/python3.10/threading.py", line 1091, in join
    raise RuntimeError("cannot join thread before it is started")
builtins.RuntimeError: cannot join thread before it is started

2023-03-27 17:13:48 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 501, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 532, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 512, in addCallbacks
    self._runCallbacks()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 513, in _continueFiring
    callable(*args, **kwargs)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/base.py", line 1082, in _stopThreadPool
    self.threadpool.stop()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/python/threadpool.py", line 275, in stop
    thread.join()
  File "/usr/lib/python3.10/threading.py", line 1091, in join
    raise RuntimeError("cannot join thread before it is started")
builtins.RuntimeError: cannot join thread before it is started

Metadata

Typesense Version: latest -- installed yesterday

OS: CentOS 8

jasonbosco commented 1 year ago

Are you running on an ARM architecture CPU by any chance? If so, could you try on an AMD/Intel CPU?

arrondev commented 1 year ago

Hi, I am running it on x86_64 so definitely not ARM architecture.

jasonbosco commented 1 year ago

Could you share the contents of the scraper config file? config.json?

arrondev commented 1 year ago

I updated docker engine and it ran fine now. Sorry didn't think it could have been possibly docker engine being the problem.

jasonbosco commented 1 year ago

Good to know!