hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
6.63k stars 240 forks source link

Crawler issues - lacking images and code blocks #471

Closed Capsup closed 1 month ago

Capsup commented 1 month ago

Hey yo!

I've been looking for a system like this for a long time but they all tend to have the same problem: crawling the websites I save usually ends up with a low quality local version of the website, which just makes it useless to me to combat link rot.

As an example, I am trying to locally cache this link and the code blocks, which are the most important on the site, does not get saved locally. Nor does the images below them either.

Real site: image

Cached site: image

Another problem like it can be observed when crawling reddit links: image

In general, many of the websites I tried crawling is missing important content to me. It's especially "code blocks" that seems to not be cached locally.

Are there any opportunities for me to configure Monolith to aid me in caching the content that I require? Or can hoarder itself do something to improve the quality of the cache?

kamtschatka commented 1 month ago

You are looking at the "Cached Content" tab, not at the "Archive" tab. Cache Content extracts the HTML and renders it then in hoarder, so that has a lot of limitations. If you enable monolith archiving via the environment variables, there will be an archive (see the dropdown above). How does it look if you use this?

Capsup commented 1 month ago

It seems you are right, that is my bad. I tried looking in the config for that option but must have missed it. I ended up assuming "cached" was the equivalent.

To anyone who might run into the same issue, setting CRAWLER_FULL_PAGE_ARCHIVE to true activated the archieve functionality and that works as expected.

May I suggest we add a part to the docs on the website about enabling this option, specifically, in the installation section?