hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
3.53k stars 132 forks source link

[Crawler] Failed to connect to the browser instance, will retry in 5 secs #248

Closed francisafu closed 1 month ago

francisafu commented 3 months ago

The workers continue to output error information, and the crawler doesn't work.

1.Workers' log:

2024-06-21T17:13:05.149Z info: Workers version: 0.14.0
2024-06-21T17:13:05.164Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:05.183Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:06.905Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:06.906Z info: Starting crawler worker ...
2024-06-21T17:13:06.908Z info: Starting inference worker ...
2024-06-21T17:13:06.908Z info: Starting search indexing worker ...
2024-06-21T17:13:11.907Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:11.908Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:13.510Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:18.512Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:18.513Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:20.065Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:25.067Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:25.067Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:26.662Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:31.663Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:31.664Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:33.265Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:38.265Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:38.267Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:39.878Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:44.878Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:44.879Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:46.436Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:51.439Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:51.439Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:53.037Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:13:58.039Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:13:58.040Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:13:59.594Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:14:04.595Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:14:04.596Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:14:06.190Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-06-21T17:14:11.191Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-06-21T17:14:11.192Z info: [Crawler] Successfully resolved IP address, new address: http://172.20.0.3:9222/
2024-06-21T17:14:12.734Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
......(continue goes on)

2.Chrome's log:

[0621/171258.016381:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0621/171258.017399:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0621/171258.021242:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0621/171258.024240:WARNING:dns_config_service_linux.cc(427)] Failed to read DnsConfig.
[0621/171258.104721:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0621/171258.104812:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended

DevTools listening on ws://0.0.0.0:9222/devtools/browser/17f397b8-a977-4ace-80bc-6fca6b9b4b4a
[0621/171258.112522:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[0621/171258.168899:WARNING:sandbox_linux.cc(418)] InitializeSandbox() called with multiple threads in process gpu-process.
[0621/171258.266562:WARNING:dns_config_service_linux.cc(427)] Failed to read DnsConfig.

3.Docker compose file(totally the same with default, except with the port redirection):

version: "3.8"
services:
  web:
    image: ghcr.io/hoarder-app/hoarder-web:${HOARDER_VERSION:-release}
    restart: unless-stopped
    volumes:
      - data:/data
    ports:
      - 6600:3000
    env_file:
      - .env
    environment:
      REDIS_HOST: redis
      MEILI_ADDR: http://meilisearch:7700
      DATA_DIR: /data
  redis:
    image: redis:7.2-alpine
    restart: unless-stopped
    volumes:
      - redis:/data
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
  meilisearch:
    image: getmeili/meilisearch:v1.6
    restart: unless-stopped
    env_file:
      - .env
    environment:
      MEILI_NO_ANALYTICS: "true"
    volumes:
      - meilisearch:/meili_data
  workers:
    image: ghcr.io/hoarder-app/hoarder-workers:${HOARDER_VERSION:-release}
    restart: unless-stopped
    volumes:
      - data:/data
    env_file:
      - .env
    environment:
      REDIS_HOST: redis
      MEILI_ADDR: http://meilisearch:7700
      BROWSER_WEB_URL: http://chrome:9222
      DATA_DIR: /data
      # OPENAI_API_KEY: ...
    depends_on:
      web:
        condition: service_started

volumes:
  redis:
  meilisearch:
  data:

4.Environment:

MohamedBassem commented 3 months ago

That's interesting, because your configuration looks good to me. Let's start with the obvious suggestions, have you attempted to turn down the stack and turn it up again? :D

francisafu commented 3 months ago

That's interesting, because your configuration looks good to me. Let's start with the obvious suggestions, have you attempted to turn down the stack and turn it up again? :D

Yeah of course, multiple times turn on&off, compose up&down, doesn't work, still the same problem occurred.

francisafu commented 3 months ago

Forgot to post the ENV file. I'll put it here:

HOARDER_VERSION=release
NEXTAUTH_SECRET=some_random_keys
MEILI_MASTER_KEY=some_other_random_keys
NEXTAUTH_URL=http://192.168.124.2:6600
MAX_ASSET_SIZE_MB=20480
OPENAI_API_KEY=fk**************
OPENAI_BASE_URL=https://*****.net
INFERENCE_LANG=chinese
dodying commented 2 months ago

crawlerWorker.ts前插入console.log(e);后发现是worker会下载github上adblocker的easylist规则,而且不知道为什么使用环境变量设置代理了也没用。 使用hosts指定github的ip地址,就可以正常了

francisafu commented 2 months ago

crawlerWorker.ts前插入console.log(e);后发现是worker会下载github上adblocker的easylist规则,而且不知道为什么使用环境变量设置代理了也没用。 使用hosts指定github的ip地址,就可以正常了

Well, it doesn't seems like a network problem, I tried to fix it as you said, however the problem still exists.

Here is the network connection

/app/apps/workers # ping github.com
PING github.com (140.82.112.4): 56 data bytes
64 bytes from 140.82.112.4: seq=0 ttl=47 time=252.627 ms
64 bytes from 140.82.112.4: seq=1 ttl=47 time=252.859 ms
64 bytes from 140.82.112.4: seq=2 ttl=47 time=252.153 ms
64 bytes from 140.82.112.4: seq=3 ttl=47 time=252.746 ms
64 bytes from 140.82.112.4: seq=4 ttl=47 time=252.870 ms
^C
--- github.com ping statistics ---
6 packets transmitted, 5 packets received, 16% packet loss
round-trip min/avg/max = 252.153/252.651/252.870 ms
/app/apps/workers # ping raw.githubusercontent.com
PING raw.githubusercontent.com (185.199.111.133): 56 data bytes
64 bytes from 185.199.111.133: seq=0 ttl=54 time=111.197 ms
64 bytes from 185.199.111.133: seq=1 ttl=54 time=110.841 ms
64 bytes from 185.199.111.133: seq=4 ttl=54 time=112.224 ms
64 bytes from 185.199.111.133: seq=5 ttl=54 time=113.838 ms
64 bytes from 185.199.111.133: seq=6 ttl=54 time=111.442 ms
^C
--- raw.githubusercontent.com ping statistics ---
7 packets transmitted, 5 packets received, 28% packet loss
round-trip min/avg/max = 110.841/111.908/113.838 ms

And afterwards I tried to fetch a link, here's the log

2024-07-18T09:41:07.591Z info: [Crawler][17] Will crawl "https://www.baidu.com/" for link with id "atrqsg02v8ugw7fwehlygwh6"
2024-07-18T09:41:07.591Z info: [Crawler][17] Attempting to determine the content-type for the url https://www.baidu.com/
2024-07-18T09:41:07.683Z info: [Crawler][17] Content-type for the url https://www.baidu.com/ is "text/html"
2024-07-18T09:41:07.684Z error: [Crawler][17] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true
2024-07-18T09:41:08.736Z info: [Crawler][17] Will crawl "https://www.baidu.com/" for link with id "atrqsg02v8ugw7fwehlygwh6"
2024-07-18T09:41:08.736Z info: [Crawler][17] Attempting to determine the content-type for the url https://www.baidu.com/
2024-07-18T09:41:09.860Z info: [Crawler][17] Content-type for the url https://www.baidu.com/ is "text/html"
2024-07-18T09:41:09.861Z error: [Crawler][17] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true
2024-07-18T09:41:11.945Z info: [Crawler][17] Will crawl "https://www.baidu.com/" for link with id "atrqsg02v8ugw7fwehlygwh6"
2024-07-18T09:41:11.945Z info: [Crawler][17] Attempting to determine the content-type for the url https://www.baidu.com/
2024-07-18T09:41:12.025Z info: [Crawler][17] Content-type for the url https://www.baidu.com/ is "text/html"
2024-07-18T09:41:12.027Z error: [Crawler][17] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true
2024-07-18T09:41:16.058Z info: [Crawler][17] Will crawl "https://www.baidu.com/" for link with id "atrqsg02v8ugw7fwehlygwh6"
2024-07-18T09:41:16.058Z info: [Crawler][17] Attempting to determine the content-type for the url https://www.baidu.com/
2024-07-18T09:41:16.149Z info: [Crawler][17] Content-type for the url https://www.baidu.com/ is "text/html"
2024-07-18T09:41:16.151Z error: [Crawler][17] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true
2024-07-18T09:41:24.181Z info: [Crawler][17] Will crawl "https://www.baidu.com/" for link with id "atrqsg02v8ugw7fwehlygwh6"
2024-07-18T09:41:24.181Z info: [Crawler][17] Attempting to determine the content-type for the url https://www.baidu.com/
2024-07-18T09:41:24.271Z info: [Crawler][17] Content-type for the url https://www.baidu.com/ is "text/html"
2024-07-18T09:41:24.272Z error: [Crawler][17] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true
francisafu commented 1 month ago

Similiar with #331 ,issue closed.