Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage

ljq29 commented 1 month ago

Bug: Images Are Not Loaded Correctly - All Images Appear as "Broken" in the Webpage

Description

When using Hoarder to scrape images from a webpage, none of the images are loaded correctly. Instead, all the images appear as "broken" or missing. It seems that Hoarder is failing to fetch the images properly from the webpage.

Steps to Reproduce

Use Hoarder to scrape a webpage with images.
Check the output for images.
Notice that all images are displayed as broken or missing.

Expected Behavior

The images should be properly fetched and displayed in the output without being broken.

Actual Behavior

All images on the webpage show up as broken, indicating that the image URLs or fetching process might not be working as expected.

Possible Causes

The image URLs could be relative paths, and Hoarder might not be converting them to absolute URLs.
There could be an issue with how Hoarder processes image tags in the webpage's HTML.
There might be network or permission issues preventing image fetching.

Environment

Hoarder version: [version]
Operating system: [OS]
Any relevant details or logs: [include logs if necessary]

Additional Context

Any webpage with images will have this issue, making it impossible to scrape or view images correctly.

kamtschatka commented 1 month ago

So this is definitely not a general problem, as it works just fine and you are the first one reporting this. Considering that you had issues setting up your environment before (https://github.com/hoarder-app/hoarder/issues/487), did you check everything is correct?

Are the log files showing any errors?
Can you provide a sample page where this actually happens?
Can you show a screenshot of what you see?

ljq29 commented 1 month ago

Description

I found an example at this link, where the original image is:

Image from Baijiahao

And the screenshot in Hoarder is:

Image in Hoarder

Additionally, I just noticed that Hoarder uses the original image link, rather than caching the image to its own server like Cubox does. Would it be possible to improve this aspect?

Benefits

Data Integrity: Caching the image on the server ensures that the image is always available, even if the original link becomes unavailable.
Security: Storing images locally on the server can prevent issues related to linking to external content that may change or be compromised.
Performance: Serving images from the same server could potentially improve the loading speed and provide a smoother user experience.

Proposed Solution

Update Hoarder to cache the linked images to its own server, similar to how Cubox handles images.
Ensure that cached images are stored securely and can be accessed efficiently within the application.

kamtschatka commented 1 month ago

please check out the configs flags in the documentation. You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true.

From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.

ljq29 commented 1 month ago

please check out the configs flags in the documentation. You can already configure the archives by setting CRAWLER_FULL_PAGE_ARCHIVE to true.

From what I can tell, those requests are first blocked by the browser due to Opaque Request Blocking. Even if we were to prevent that with some changes, Baidu simply does not want you to embed their images in other webpages, so this does not work. For the preview you'll have to live with that. If you configure the archiving, everything is downloaded correctly, because it is no longer constrained by the browser rules.

As for your issue with the env configuration, where exactly should this be placed within the containers? I tried deploying it in the worker container, but it did not work.

When I placed the env configuration in the web container, several web pages repeatedly failed to fetch:

kamtschatka commented 1 month ago

yes, in the worker environment variables (Btw: you are using the old setup, where web and worker are separate docker containers). You probably did not look at the "Archive" tag in the preview, but at the same screen from above

hoarder-app / hoarder