flibusta.is: some pages are not downloaded

openzim / zimit

Make a ZIM file from any Web site and surf offline!

GNU General Public License v3.0

344 stars 24 forks source link

flibusta.is: some pages are not downloaded #414

Open vitaly-zdanevich opened 3 days ago

vitaly-zdanevich commented 3 days ago

Hi, I downloaded https://flibusta.is using your Docker examples from the README, around 90 GB. And I see that some links of the same type are not fetched - they have absolute URLs, open Firefox on click (I use Kiwix).

If you try to download it - try mouse hover of links on this page https://flibusta.is/a/9450 - for example here only 2 links in the middle are downloaded:

End of logs looks like ok:

Thanks.

benoit74 commented 2 days ago

Thank you for reporting this.

I confirm this is not the expected behavior.

Please have a look at https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions#some-links-are-not-pointing-inside-the-zim-but-to-the-online-website (I've just written this so that it can benefit others as well). Could be interesting to check logs for errors on the missing pages, and to run again the crawl with only this page as --url and --depth 1 --scopeType custom --include ".*", i.e. something like docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"

vitaly-zdanevich commented 2 days ago

docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"

Thanks, I did it - looks like downloaded all link on the page...

vitaly-zdanevich commented 2 days ago

From --help:

--depth DEPTH The depth of the crawl for all seeds

What is that mean? What is a seed? I just want to download the full website..

benoit74 commented 2 days ago

The seeds are the URLs you pass with --url (don't recall if zimit supports multiple seeds, probably not, only browsertrix crawler does if I'm not mistaken).

--depth 1 means: explore the seed (https://flibusta.is/a/9450 in our example) and all pages linked from this seed page. Nothing more. This does not download the full website at all. This is useful in many circumstances, here it allows you to quickly confirm that there is probably no bug is scraper code (the page works this time) but probably something else. I would suspect something around intermittent issues on your machine / the upstream server which caused some pages to fail download. You should analyze the logs of the full run to find details about those pages whose links are not working

vitaly-zdanevich commented 2 days ago

You should analyze the logs of the full run

That website is big, I cannot scroll the full log - is it possible to store the log to a file?

vitaly-zdanevich commented 2 days ago

--depth 1

What is the default value? Depth 1 looks like it will not download pages linked by pages?

vitaly-zdanevich commented 2 days ago

Two --name?