hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
3.35k stars 120 forks source link

Unable to crawl any site #331

Open djl0 opened 1 month ago

djl0 commented 1 month ago

I'm creating a new issue after I originally (incorrectly) thought it was related to https://github.com/hoarder-app/hoarder/issues/327

When adding a bookmark either in web or CLI, it's not able to fetch anything from the target site. Looking at the container logs, the worker container can't connect to the chrome container:

2024-07-27T21:45:35.867Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:36.609Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:36.666Z info: [Crawler][6] Will crawl "https://core.telegram.org/api/" for link with id "hgfhadzqyrs96i1z1l1zmv2t"
2024-07-27T21:45:36.666Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:37.316Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:37.317Z error: [Crawler][6] Crawling job failed: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  logger.info(
  `[Crawler][${jobId}] Successfully navigated to "${url}". Waiting for the page to load ...`,
  )

2024-07-27T21:45:38.527Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-07-27T21:45:38.528Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.5:9222/
2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs

The chrome container has an error regarding dbus. Not sure how important that is. It's discussed being an issue here for another project, but I couldn't apt while inside the container, so i couldn't try installing dbus.

[0727/213832.985728:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009490:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009634:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.231216:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
[0727/213833.303223:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0727/213833.303483:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[0727/213833.347854:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.

DevTools listening on ws://0.0.0.0:9222/devtools/browser/d3983fdb-6ddf-4b23-901b-226e2fa783ce

Any guidance would be greatly appreciated!

MohamedBassem commented 1 month ago

Ok, now it's yet another error.

2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs

Can you share your compose file? Did you rename the chrome container name?

hongruilin commented 1 month ago

I'm creating a new issue after I originally (incorrectly) thought it was related to #327

When adding a bookmark either in web or CLI, it's not able to fetch anything from the target site. Looking at the container logs, the worker container can't connect to the chrome container:

2024-07-27T21:45:35.867Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:36.609Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:36.666Z info: [Crawler][6] Will crawl "https://core.telegram.org/api/" for link with id "hgfhadzqyrs96i1z1l1zmv2t"
2024-07-27T21:45:36.666Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:37.316Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:37.317Z error: [Crawler][6] Crawling job failed: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  logger.info(
  `[Crawler][${jobId}] Successfully navigated to "${url}". Waiting for the page to load ...`,
  )

2024-07-27T21:45:38.527Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-07-27T21:45:38.528Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.5:9222/
2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs

The chrome container has an error regarding dbus. Not sure how important that is. It's discussed being an issue here for another project, but I couldn't while inside the container, so i couldn't try installing dbus.apt

[0727/213832.985728:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009490:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009634:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.231216:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
[0727/213833.303223:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0727/213833.303483:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[0727/213833.347854:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.

DevTools listening on ws://0.0.0.0:9222/devtools/browser/d3983fdb-6ddf-4b23-901b-226e2fa783ce

Any guidance would be greatly appreciated!

Has your issue been resolved? I also encountered the same problem when deploying with Docker on Windows, even changing the Chrome container name didn't solve it. Currently, I can only run hoarder normally on the Linux system

Antebios commented 1 month ago

I am having a very similar issue. I am using the kubernetes templates provided by this project. I have my bookmarks imported, but they cannot be crawled:

2024-08-02T17:26:24.210Z info: [Crawler][8239] Will crawl "http://forum.xda-developers.com/showthread.php?t=838448" for link with id "zbqya9md0clzrahb6vls2qda"
2024-08-02T17:26:24.210Z info: [Crawler][8239] Attempting to determine the content-type for the url http://forum.xda-developers.com/showthread.php?t=838448
2024-08-02T17:26:24.727Z info: [Crawler][8239] Content-type for the url http://forum.xda-developers.com/showthread.php?t=838448 is "text/html; charset=utf-8"
2024-08-02T17:26:24.729Z error: [Crawler][8239] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true

The chrome container logs are these:

[0802/171801.792156:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0802/171801.792183:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
jacobsandersen commented 1 month ago

I am experiencing the same issue.

MohamedBassem commented 3 weeks ago

For anyone who was facing crawling issues because on kubernetes, apparently, the chrome service was missing, and it's now fixed in https://github.com/hoarder-app/hoarder/pull/358/files

cc. @Antebios