typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Connection was refused by other side running scraper via docker #49

Closed noghartt closed 9 months ago

noghartt commented 9 months ago

Description

I'm trying to run the Typesense DocSearch Scraper on a Docusaurus build locally (http://localhost:3000). But I'm facing an issue that seems related to scrapy:

DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://localhost:3000> (failed 1 times): Connection was refused by other side: 111: Connection refused.

Steps to reproduce

Run the command

Expected Behavior

Run command and start scrapping putting in the DB:

docker run -it --env-file=.env -e "CONFIG=$(cat config.json | jq -r tostring)" typesense/docsearch-scraper:0.9.1

The env file is:

TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=host.docker.internal
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http

Actual Behavior

Facing this error message:

INFO:scrapy.utils.log:Scrapy 2.9.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Linux-6.5.7-orbstack-00109-gd8500ae6683d-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'ERROR',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Typesense DocSearch Scraper (Bot; '
               'https://typesense.org/docs/guide/docsearch.html)'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
  warn(

WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  self.fingerprinter = fingerprinter or RequestFingerprinter()

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/spidermiddlewares/offsite.py:80: PortWarning: allowed_domains accepts only domains without ports. Ignoring entry localhost:3000 in allowed_domains.
  warnings.warn(message, PortWarning)

DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://localhost:3000/sitemap.xml> (failed 1 times): Connection was refused by other side: 111: Connection refused.
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://localhost:3000> (failed 1 times): Connection was refused by other side: 111: Connection refused.
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://localhost:3000> (failed 2 times): Connection was refused by other side: 111: Connection refused.
DEBUG:scrapy.downloadermiddlewares.retry:Retrying <GET http://localhost:3000/sitemap.xml> (failed 2 times): Connection was refused by other side: 111: Connection refused.
ERROR:scrapy.downloadermiddlewares.retry:Gave up retrying <GET http://localhost:3000/sitemap.xml> (failed 3 times): Connection was refused by other side: 111: Connection refused.
2023-10-28 01:37:05 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://localhost:3000/sitemap.xml> (failed 3 times): Connection was refused by other side: 111: Connection refused.
ERROR:scrapy.downloadermiddlewares.retry:Gave up retrying <GET http://localhost:3000> (failed 3 times): Connection was refused by other side: 111: Connection refused.
2023-10-28 01:37:05 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://localhost:3000> (failed 3 times): Connection was refused by other side: 111: Connection refused.
ERROR:woovi-devdocs-1:Failure without response Connection was refused by other side: 111: Connection refused.
2023-10-28 01:37:05 [woovi-devdocs-1] ERROR: Failure without response Connection was refused by other side: 111: Connection refused.
ERROR:woovi-devdocs-1:Failure without response Connection was refused by other side: 111: Connection refused.
2023-10-28 01:37:05 [woovi-devdocs-1] ERROR: Failure without response Connection was refused by other side: 111: Connection refused.
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 6,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 6,
 'downloader/request_bytes': 1575,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'elapsed_time_seconds': 1.088754,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 10, 28, 1, 37, 5, 168615),
 'log_count/ERROR': 4,
 'memusage/max': 83546112,
 'memusage/startup': 83546112,
 'retry/count': 4,
 'retry/max_reached': 2,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 4,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2023, 10, 28, 1, 37, 4, 79861)}
INFO:scrapy.core.engine:Spider closed (finished)

Crawling issue: nbHits 0 for woovi-devdocs-1

Metadata

Docusaaurus Scraper Config file

{
  "index_name": "woovi-devdocs-1",
  "start_urls": [
    "http://localhost:3000"
  ],
  "sitemap_urls": [
    "http://localhost:3000/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1, article h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 42650
}

Typesense Version:

0.25.1

Typesense Scraper Version:

0.9.1

OS:

macOS Sonoma

wanderanimrod commented 9 months ago

@noghartt ,

I see you are using localhost in your start_urls. That doesn't work since you are running the scrapper inside a container. Inside the container, localhost points to the containers internal network, and there's nothing running on port 3000 on that network. That's why you are getting connection refused.

What you should do instead is use the special docker domain that points to the host network. On mac os, it is host.docker.internal. So your start_urls should be:

"start_urls": [
    "http://host.docker.internal:3000"
  ],
wanderanimrod commented 9 months ago

But, since you are using a port, the crawling functionality might not work as expected. You might need to run your site server on port 80 on localhost. See #50 for details.

noghartt commented 9 months ago

@noghartt ,

I see you are using localhost in your start_urls. That doesn't work since you are running the scrapper inside a container. Inside the container, localhost points to the containers internal network, and there's nothing running on port 3000 on that network. That's why you are getting connection refused.

What you should do instead is use the special docker domain that points to the host network. On mac os, it is host.docker.internal. So your start_urls should be:

"start_urls": [
    "http://host.docker.internal:3000"
  ],

Hey, @wanderanimrod! It's works!

I appreciate your help, thanks!