hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
5.14k stars 165 forks source link

Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage #491

Closed ljq29 closed 4 weeks ago

ljq29 commented 4 weeks ago

Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage

Description

When using Hoarder to scrape images from a webpage, none of the images are loaded correctly. Instead, all the images appear as "broken" or missing. It seems that Hoarder is failing to fetch the images properly from the webpage.

Steps to Reproduce

  1. Use Hoarder to scrape a webpage with images.
  2. Check the output for images.
  3. Notice that all images are displayed as broken or missing.

Expected Behavior

The images should be properly fetched and displayed in the output without being broken.

Actual Behavior

All images on the webpage show up as broken, indicating that the image URLs or fetching process might not be working as expected.

Possible Causes

Environment

Additional Context

Any webpage with images will have this issue, making it impossible to scrape or view images correctly.

kamtschatka commented 4 weeks ago

So this is definitely not a general problem, as it works just fine and you are the first one reporting this. Considering that you had issues setting up your environment before (https://github.com/hoarder-app/hoarder/issues/487), did you check everything is correct?

ljq29 commented 4 weeks ago

Description

I found an example at this link, where the original image is:

Image from Baijiahao

And the screenshot in Hoarder is:

Image in Hoarder

Additionally, I just noticed that Hoarder uses the original image link, rather than caching the image to its own server like Cubox does. Would it be possible to improve this aspect?

Benefits

Proposed Solution

kamtschatka commented 4 weeks ago

please check out the configs flags in the documentation. You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true.

From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.

ljq29 commented 4 weeks ago

please check out the configs flags in the documentation. You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true.

From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.

As for your issue with the env configuration, where exactly should this be placed within the containers? I tried deploying it in the worker container, but it did not work. image

When I placed the env configuration in the web container, several web pages repeatedly failed to fetch: image

kamtschatka commented 4 weeks ago

yes, in the worker environment variables (Btw: you are using the old setup, where web and worker are separate docker containers). You probably did not look at the "Archive" tag in the preview, but at the same screen from above