URL seems to get hoarder stuck in a crawl loop on v0.18.0

dimatx commented 1 week ago

I seem to have a URL that gets hoarder stuck in a loop where it tries to crawl, then recrawls, etc. It only stops when I delete the bookmark.

Please let me know if you need any more info than what I provided.

2024-10-14T16:50:49.322Z info: [search][890] Attempting to index bookmark with id q10606sx8ev4xhstqbjw5gaq ...
2024-10-14T16:50:49.340Z info: [inference][889] Starting an inference job for bookmark with id "q10606sx8ev4xhstqbjw5gaq"
2024-10-14T16:50:49.511Z info: [search][890] Completed successfully
2024-10-14T16:50:50.949Z info: [inference][889] Inferring tag for bookmark "q10606sx8ev4xhstqbjw5gaq" used 2936 tokens and inferred: LineageOS,Lenovo ThinkSmart View,Home Automation,Android Installation,Open Source
2024-10-14T16:50:51.001Z info: [inference][889] Completed successfully
2024-10-14T16:50:51.556Z info: [search][891] Attempting to index bookmark with id q10606sx8ev4xhstqbjw5gaq ...
2024-10-14T16:50:51.673Z info: [search][891] Completed successfully
2024-10-14T16:54:23.815Z info: [Crawler][892] Will crawl "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/" for link with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:54:23.815Z info: [Crawler][892] Attempting to determine the content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/
2024-10-14T16:54:23.882Z info: [Crawler][892] Content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/ is "text/html; charset=UTF-8"
2024-10-14T16:54:23.907Z info: [search][893] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:54:23.975Z info: [search][893] Completed successfully
2024-10-14T16:54:26.944Z info: [Crawler][892] Successfully navigated to "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/". Waiting for the page to load ...
2024-10-14T16:54:29.602Z info: [Crawler][892] Finished waiting for the page to load.
2024-10-14T16:54:29.833Z info: [Crawler][892] Finished capturing page content and a screenshot. FullPageScreenshot: false
2024-10-14T16:54:29.842Z info: [Crawler][892] Will attempt to extract metadata from page ...
2024-10-14T16:54:30.699Z info: [Crawler][892] Will attempt to extract readable content ...
2024-10-14T16:54:31.384Z info: [Crawler][892] Done extracting readable content.
2024-10-14T16:54:31.396Z info: [Crawler][892] Stored the screenshot as assetId: 249e0a78-90b8-4e26-b051-17730c928aae
2024-10-14T16:54:31.443Z info: [Crawler][892] Done extracting metadata from the page.
2024-10-14T16:54:31.443Z info: [Crawler][892] Downloading image from "https://odsonfinance.com/wp-content/uploads/2024/01/How-to-do-a-Backdoor-Roth-IRA-1.png"
2024-10-14T16:54:31.553Z info: [Crawler][892] Downloaded image as assetId: c6acee98-fa08-4076-8f7c-8e05becff000
2024-10-14T16:54:31.612Z info: [Crawler][892] Completed successfully
2024-10-14T16:54:32.419Z info: [search][895] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:54:32.437Z info: [inference][894] Starting an inference job for bookmark with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:54:32.554Z info: [search][895] Completed successfully
2024-10-14T16:54:33.930Z info: [inference][894] Inferring tag for bookmark "k1l8zj5ixpgj9hugbvmibqfc" used 2122 tokens and inferred: Roth IRA,Backdoor Roth,Fidelity,Personal Finance,Investing
2024-10-14T16:54:33.971Z info: [inference][894] Completed successfully
2024-10-14T16:54:34.587Z info: [search][896] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:54:34.652Z info: [search][896] Completed successfully
2024-10-14T16:55:03.684Z info: [Crawler][897] Will crawl "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/" for link with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:55:03.684Z info: [Crawler][897] Attempting to determine the content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/
2024-10-14T16:55:03.757Z info: [Crawler][897] Content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/ is "text/html; charset=UTF-8"
2024-10-14T16:55:07.264Z info: [Crawler][897] Successfully navigated to "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/". Waiting for the page to load ...
2024-10-14T16:55:11.589Z info: [Crawler][897] Finished waiting for the page to load.
2024-10-14T16:55:11.803Z info: [Crawler][897] Finished capturing page content and a screenshot. FullPageScreenshot: false
2024-10-14T16:55:11.810Z info: [Crawler][897] Will attempt to extract metadata from page ...
2024-10-14T16:55:12.468Z info: [Crawler][897] Will attempt to extract readable content ...
2024-10-14T16:55:13.130Z info: [Crawler][897] Done extracting readable content.
2024-10-14T16:55:13.141Z info: [Crawler][897] Stored the screenshot as assetId: 9a16e763-4619-46d3-8e9e-281e2280acec
2024-10-14T16:55:13.181Z info: [Crawler][897] Done extracting metadata from the page.
2024-10-14T16:55:13.182Z info: [Crawler][897] Downloading image from "https://odsonfinance.com/wp-content/uploads/2024/01/How-to-do-a-Backdoor-Roth-IRA-1.png"
2024-10-14T16:55:13.289Z info: [Crawler][897] Downloaded image as assetId: 76a7a90b-ccab-462e-9f06-9afa6799cf93
2024-10-14T16:55:13.381Z info: [Crawler][897] Will attempt to archive page ...
2024-10-14T16:55:14.169Z info: [inference][898] Starting an inference job for bookmark with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:55:14.186Z info: [search][899] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:55:14.249Z info: [search][899] Completed successfully
2024-10-14T16:55:16.005Z info: [inference][898] Inferring tag for bookmark "k1l8zj5ixpgj9hugbvmibqfc" used 2122 tokens and inferred: Roth IRA,Backdoor Roth,Fidelity,Investing,Personal Finance
2024-10-14T16:55:16.041Z info: [inference][898] Completed successfully
2024-10-14T16:55:16.287Z info: [search][900] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:55:16.348Z info: [search][900] Completed successfully
2024-10-14T16:56:03.719Z info: [Crawler][897] Will crawl "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/" for link with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:56:03.719Z info: [Crawler][897] Attempting to determine the content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/
2024-10-14T16:56:03.771Z info: [Crawler][897] Content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/ is "text/html; charset=UTF-8"
2024-10-14T16:56:06.978Z info: [Crawler][897] Successfully navigated to "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/". Waiting for the page to load ...
2024-10-14T16:56:09.540Z info: [Crawler][897] Finished waiting for the page to load.
2024-10-14T16:56:09.737Z info: [Crawler][897] Finished capturing page content and a screenshot. FullPageScreenshot: false
2024-10-14T16:56:09.747Z info: [Crawler][897] Will attempt to extract metadata from page ...
2024-10-14T16:56:10.481Z info: [Crawler][897] Will attempt to extract readable content ...
2024-10-14T16:56:11.025Z info: [Crawler][897] Done extracting readable content.
2024-10-14T16:56:11.038Z info: [Crawler][897] Stored the screenshot as assetId: 72382d5e-3a19-4382-83d2-1e1cec207c1e
2024-10-14T16:56:11.086Z info: [Crawler][897] Done extracting metadata from the page.
2024-10-14T16:56:11.086Z info: [Crawler][897] Downloading image from "https://odsonfinance.com/wp-content/uploads/2024/01/How-to-do-a-Backdoor-Roth-IRA-1.png"
2024-10-14T16:56:11.215Z info: [Crawler][897] Downloaded image as assetId: 6d054521-87bd-4e25-9b9a-f12c81784706
2024-10-14T16:56:11.312Z info: [Crawler][897] Will attempt to archive page ...
2024-10-14T16:56:12.066Z info: [inference][901] Starting an inference job for bookmark with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:56:12.082Z info: [search][902] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:56:12.146Z info: [search][902] Completed successfully
2024-10-14T16:56:15.205Z info: [inference][901] Inferring tag for bookmark "k1l8zj5ixpgj9hugbvmibqfc" used 2123 tokens and inferred: Backdoor Roth IRA,Fidelity,Retirement Planning,Personal Finance,Investing
2024-10-14T16:56:15.258Z info: [inference][901] Completed successfully
2024-10-14T16:56:16.187Z info: [search][903] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:56:16.301Z info: [search][903] Completed successfully
2024-10-14T16:57:03.761Z info: [Crawler][897] Will crawl "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/" for link with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:57:03.761Z info: [Crawler][897] Attempting to determine the content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/
2024-10-14T16:57:03.813Z info: [Crawler][897] Content-type for the url https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/ is "text/html; charset=UTF-8"
2024-10-14T16:57:08.863Z info: [Crawler][897] Successfully navigated to "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/". Waiting for the page to load ...
2024-10-14T16:57:11.373Z info: [Crawler][897] Finished waiting for the page to load.
2024-10-14T16:57:11.577Z info: [Crawler][897] Finished capturing page content and a screenshot. FullPageScreenshot: false
2024-10-14T16:57:11.586Z info: [Crawler][897] Will attempt to extract metadata from page ...
2024-10-14T16:57:12.127Z info: [Crawler][897] Will attempt to extract readable content ...
2024-10-14T16:57:12.678Z info: [Crawler][897] Done extracting readable content.
2024-10-14T16:57:12.690Z info: [Crawler][897] Stored the screenshot as assetId: c9b9223a-8cfc-4061-aab7-362035ec162e
2024-10-14T16:57:12.733Z info: [Crawler][897] Done extracting metadata from the page.
2024-10-14T16:57:12.733Z info: [Crawler][897] Downloading image from "https://odsonfinance.com/wp-content/uploads/2024/01/How-to-do-a-Backdoor-Roth-IRA-1.png"
2024-10-14T16:57:12.850Z info: [Crawler][897] Downloaded image as assetId: 88122b54-c814-439b-96bf-f265809f2cbe
2024-10-14T16:57:12.945Z info: [Crawler][897] Will attempt to archive page ...
2024-10-14T16:57:13.718Z info: [inference][904] Starting an inference job for bookmark with id "k1l8zj5ixpgj9hugbvmibqfc"
2024-10-14T16:57:13.735Z info: [search][905] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:57:13.803Z info: [search][905] Completed successfully
2024-10-14T16:57:15.744Z info: [inference][904] Inferring tag for bookmark "k1l8zj5ixpgj9hugbvmibqfc" used 2124 tokens and inferred: Backdoor Roth IRA,Fidelity,Personal Finance,Retirement Planning,Investing Strategies
2024-10-14T16:57:15.780Z info: [inference][904] Completed successfully
2024-10-14T16:57:15.830Z info: [search][906] Attempting to index bookmark with id k1l8zj5ixpgj9hugbvmibqfc ...
2024-10-14T16:57:15.945Z info: [search][906] Completed successfully

raviwarrier commented 1 week ago

I have a similar problem. really old bookmarks from now defunct websites or apps. But when I try to search their URL in hoarder, I get no results and so, I have no easy way to find and delete them.

kamtschatka commented 6 days ago

I tried adding the URL "https://odsonfinance.com/chapter-2b-how-to-do-a-backdoor-roth-with-fidelity-step-by-step-instructions/" and everything works just fine. Are you on the latest version? How are you deploying hoarder?

dimatx commented 6 days ago

Docker compose and on the latest version. Any other info I can help provide for troubleshooting, assuming I can reproduce?

kamtschatka commented 5 days ago

any environment variables you have set?

dimatx commented 5 days ago

Did you try downloading the full page archive? that seems to be what is causing the loop, the process never seems to finish.

Here's my docker compose and .env.

version: "3.8"
services:
  web:
    image: ghcr.io/hoarder-app/hoarder:${HOARDER_VERSION:-release}
    restart: unless-stopped
    volumes:
      - data:/data
    ports:
      - 3200:3000
    env_file:
      - .env
    environment:
      MEILI_ADDR: http://meilisearch:7700
      BROWSER_WEB_URL: http://chrome:9222
      OPENAI_API_KEY: *********************************
      DATA_DIR: /data
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
  meilisearch:
    image: getmeili/meilisearch:v1.6
    restart: unless-stopped
    env_file:
      - .env
    environment:
      MEILI_NO_ANALYTICS: "true"
    volumes:
      - meilisearch:/meili_data
volumes:
  meilisearch: null
  data: null
networks: {}

HOARDER_VERSION=release
NEXTAUTH_SECRET=*******************
MEILI_MASTER_KEY=*******************
NEXTAUTH_URL=http://*******************:3200

kamtschatka commented 2 days ago

try increasing the CRAWLER_JOB_TIMEOUT_SEC. The default is 60 seconds, if the full page archival takes too long, it might cause this behavior.

dimatx commented 2 days ago

Made it 300 seconds, issue persists. Isn't it strange that there is a loop despite no errors/failures in the logs? It also uses OpenAI credits over and over according to the logs, so could run up a bill for someone without a low budget set in OpenAI.

hoarder-app / hoarder

URL seems to get hoarder stuck in a crawl loop on v0.18.0 #537