Crawler issues - lacking images and code blocks

Capsup commented 1 month ago

Hey yo!

I've been looking for a system like this for a long time but they all tend to have the same problem: crawling the websites I save usually ends up with a low quality local version of the website, which just makes it useless to me to combat link rot.

As an example, I am trying to locally cache this link and the code blocks, which are the most important on the site, does not get saved locally. Nor does the images below them either.

Real site:

Cached site:

Another problem like it can be observed when crawling reddit links:

In general, many of the websites I tried crawling is missing important content to me. It's especially "code blocks" that seems to not be cached locally.

Are there any opportunities for me to configure Monolith to aid me in caching the content that I require? Or can hoarder itself do something to improve the quality of the cache?

kamtschatka commented 1 month ago

You are looking at the "Cached Content" tab, not at the "Archive" tab. Cache Content extracts the HTML and renders it then in hoarder, so that has a lot of limitations. If you enable monolith archiving via the environment variables, there will be an archive (see the dropdown above). How does it look if you use this?

Capsup commented 1 month ago

It seems you are right, that is my bad. I tried looking in the config for that option but must have missed it. I ended up assuming "cached" was the equivalent.

To anyone who might run into the same issue, setting CRAWLER_FULL_PAGE_ARCHIVE to true activated the archieve functionality and that works as expected.

May I suggest we add a part to the docs on the website about enabling this option, specifically, in the installation section?

hoarder-app / hoarder

Crawler issues - lacking images and code blocks #471