Open vitaly-zdanevich opened 3 days ago
Thank you for reporting this.
I confirm this is not the expected behavior.
Please have a look at https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions#some-links-are-not-pointing-inside-the-zim-but-to-the-online-website (I've just written this so that it can benefit others as well). Could be interesting to check logs for errors on the missing pages, and to run again the crawl with only this page as --url
and --depth 1 --scopeType custom --include ".*"
, i.e. something like docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"
docker run -v $PWD/output:/output --name crawlme --rm ghcr.io/openzim/zimit:latest zimit --url "https://flibusta.is/a/9450" --name tests_ru_flibusta --depth 1 --scopeType custom --include ".*"
Thanks, I did it - looks like downloaded all link on the page...
From --help
:
--depth DEPTH The depth of the crawl for all seeds
What is that mean? What is a seed? I just want to download the full website..
The seeds are the URLs you pass with --url
(don't recall if zimit supports multiple seeds, probably not, only browsertrix crawler does if I'm not mistaken).
--depth 1
means: explore the seed (https://flibusta.is/a/9450
in our example) and all pages linked from this seed page. Nothing more. This does not download the full website at all. This is useful in many circumstances, here it allows you to quickly confirm that there is probably no bug is scraper code (the page works this time) but probably something else. I would suspect something around intermittent issues on your machine / the upstream server which caused some pages to fail download. You should analyze the logs of the full run to find details about those pages whose links are not working
You should analyze the logs of the full run
That website is big, I cannot scroll the full log - is it possible to store the log to a file?
--depth 1
What is the default value? Depth 1 looks like it will not download pages linked by pages?
Two --name
?
Hi, I downloaded https://flibusta.is using your Docker examples from the README, around 90 GB. And I see that some links of the same type are not fetched - they have absolute URLs, open Firefox on click (I use Kiwix).
If you try to download it - try mouse hover of links on this page https://flibusta.is/a/9450 - for example here only 2 links in the middle are downloaded:
End of logs looks like ok:
Thanks.