Closed onexecute closed 5 months ago
Should I still see the excluded urls scroll through the terminal output?
Nope, it means your exclude parameter is not working
I implemented the --exclude parameter as stated in the readme by creating a exclusion list of directories from urls.
Could you be more specific? --exclude
parameter is supposed to be one single regex which when matched will cause the URL to be excluded from the crawl.
Please share the exact command line you are using and example of URLs you wanna include and URLs you wanna exclude. If you cannot share the exact URL due to confidentiality issues, you can probably replace them with fake domains / paths.
Thanks. here's the command I ran. For example, I want to exclude everything except /projects such as https://www.allaboutcircuits.com/pcs/
docker run -v /output:/output --shm-size=1gb ghcr.io/openzim/zimit zimit --url https://www.allaboutcircuits.com/projects --name allaboutcircuits-projects-site --workers 1 --waitUntil domcontentloaded --exclude="(pcs | feed | control | forum | eepower | maker | podcast | twitter | author | bom | electronic-components | giveaways | articles | webinars | white-papers | ip-cores | latest | news | news | partner | podcast | privacy | tech | technical | test | textbook | tools | user | video | virtual | worksheets | write | all_about_circuits |facebook | linkedin | twitter | youtube | instagram | rss | contact | about | faq | advertise | press | sitemap | terms | privacy | policy | register | login | company | mikrocontroller | pandora | youtube)"
Do I get it correctly if I say that you want https://www.allaboutcircuits.com/projects
and all its subpages (i.e. https://www.allaboutcircuits.com/projects/) but nothing else of `https://www.allaboutcircuits.com/`?
If yes, then your issue is could be quite simple to solve.
Your problem is that by default, the scope type is prefix
, and since your URL does not contain an ending slash, the prefix is https://www.allaboutcircuits.com/.*
(i.e. the whole website).
Since the URL https://www.allaboutcircuits.com/projects/
also works well (note then ending /
, very important), I recommend that you simply use this URL (without any exclude
rule). It will default to scope type prefix
and the prefix will be https://www.allaboutcircuits.com/projects/.*
.
If you've not already read it, I recommend https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions
All that being said, your --exclude
rule is not a correct regex (or at least it won't work as intended), you must not have spaces before/after the |
.
Perfect! Thank you so much.
I'm attempting to limit the scope of a website capture to a single directory. I implemented the --exclude parameter as stated in the readme by creating a exclusion list of directories from urls. Should I still see the excluded urls scroll through the terminal output?