openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
335 stars 24 forks source link

--exclude question #293

Closed onexecute closed 5 months ago

onexecute commented 5 months ago

I'm attempting to limit the scope of a website capture to a single directory. I implemented the --exclude parameter as stated in the readme by creating a exclusion list of directories from urls. Should I still see the excluded urls scroll through the terminal output?

benoit74 commented 5 months ago

Should I still see the excluded urls scroll through the terminal output?

Nope, it means your exclude parameter is not working

I implemented the --exclude parameter as stated in the readme by creating a exclusion list of directories from urls.

Could you be more specific? --exclude parameter is supposed to be one single regex which when matched will cause the URL to be excluded from the crawl.

Please share the exact command line you are using and example of URLs you wanna include and URLs you wanna exclude. If you cannot share the exact URL due to confidentiality issues, you can probably replace them with fake domains / paths.

onexecute commented 5 months ago

Thanks. here's the command I ran. For example, I want to exclude everything except /projects such as https://www.allaboutcircuits.com/pcs/

docker run -v /output:/output --shm-size=1gb ghcr.io/openzim/zimit zimit --url https://www.allaboutcircuits.com/projects --name allaboutcircuits-projects-site --workers 1 --waitUntil domcontentloaded --exclude="(pcs | feed | control | forum | eepower | maker | podcast | twitter | author | bom | electronic-components | giveaways | articles | webinars | white-papers | ip-cores | latest | news | news | partner | podcast | privacy | tech | technical | test | textbook | tools | user | video | virtual | worksheets | write | all_about_circuits |facebook | linkedin | twitter | youtube | instagram | rss | contact | about | faq | advertise | press | sitemap | terms | privacy | policy | register | login | company | mikrocontroller | pandora | youtube)"

benoit74 commented 5 months ago

Do I get it correctly if I say that you want https://www.allaboutcircuits.com/projects and all its subpages (i.e. https://www.allaboutcircuits.com/projects/) but nothing else of `https://www.allaboutcircuits.com/`?

If yes, then your issue is could be quite simple to solve.

Your problem is that by default, the scope type is prefix, and since your URL does not contain an ending slash, the prefix is https://www.allaboutcircuits.com/.* (i.e. the whole website).

Since the URL https://www.allaboutcircuits.com/projects/ also works well (note then ending /, very important), I recommend that you simply use this URL (without any exclude rule). It will default to scope type prefix and the prefix will be https://www.allaboutcircuits.com/projects/.*.

If you've not already read it, I recommend https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions

All that being said, your --exclude rule is not a correct regex (or at least it won't work as intended), you must not have spaces before/after the |.

onexecute commented 5 months ago

Perfect! Thank you so much.