openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
338 stars 24 forks source link

Adding new --config argument to zimit #197

Closed f0sh closed 1 year ago

f0sh commented 1 year ago

Since zimit has fully migrated to use the webrecorder/browsertrix-crawler it would be nice to have more available options for the crawler. For the first step adding a --config option would be awesome for all advanced use cases, so you can configure webrecorder/browsertrix-crawler to fit your needs (e.g. blockRules, etc.) via a given yaml file.

Unfortunately currently any other from zimit unsuported options are rejected with the error:

zimit: error: unrecognized arguments: --blockMessage --blockRules
Error: Process completed with exit code 2.
rgaudin commented 1 year ago

zimit never migrated to use browsertrix-crawler. It's been using it from the start and since that day the --config option existed. We chose not to use it because we're focused on our use case where we want all options on the CLI and pass the crawler ones to the crawler and the rest to warc2zim. If we were to support options that we know we won't be using, then there's no reason not to support them all. In that case, we should refactor a bit as to not maintain a full duplicate of the crawler's option here.

f0sh commented 1 year ago

Thanks @rgaudin for clarifying kiwix's strategy on the config options and apologies for falsely mentioning the migration (I read it somewhere here though, hence my wrong conclusion).

I understand, that you want to keep full control about the configuration which is passed to browsertrix-crawler. However I experienced, that if you want to use zimit efficiently, you are somehow forced to use the blockRules feature, otherwise you are wasting too many ressources with crawling social media sites and social commenting functions.

Configuring blockrules via command line arguments (which is not possible yet) is veeeery troublesome and errorprone, as it is only possible using a JSON string. I strong believe to make this function practical, there is no way around to use a configuration file.

That's why I was directly proposing my solution, knowing, that the config file might be in conflict with the command line argument. However as I understand browsertrix-crawler's documentation, that command line arguments are always prioritized as mentioned in my PR`s commit message.

Couldn't this be a suitable way to address crawling issues?

rgaudin commented 1 year ago

Oh it's no strategy 😅 it's just how we've been doing and why but zimit is definitely not frozen.