Closed f0sh closed 1 year ago
zimit never migrated to use browsertrix-crawler. It's been using it from the start and since that day the --config
option existed.
We chose not to use it because we're focused on our use case where we want all options on the CLI and pass the crawler ones to the crawler and the rest to warc2zim.
If we were to support options that we know we won't be using, then there's no reason not to support them all. In that case, we should refactor a bit as to not maintain a full duplicate of the crawler's option here.
Thanks @rgaudin for clarifying kiwix's strategy on the config options and apologies for falsely mentioning the migration (I read it somewhere here though, hence my wrong conclusion).
I understand, that you want to keep full control about the configuration which is passed to browsertrix-crawler. However I experienced, that if you want to use zimit efficiently, you are somehow forced to use the blockRules
feature, otherwise you are wasting too many ressources with crawling social media sites and social commenting functions.
Configuring blockrules via command line arguments (which is not possible yet) is veeeery troublesome and errorprone, as it is only possible using a JSON string. I strong believe to make this function practical, there is no way around to use a configuration file.
That's why I was directly proposing my solution, knowing, that the config file might be in conflict with the command line argument. However as I understand browsertrix-crawler's documentation, that command line arguments are always prioritized as mentioned in my PR`s commit message.
Couldn't this be a suitable way to address crawling issues?
Oh it's no strategy 😅 it's just how we've been doing and why but zimit is definitely not frozen.
Since zimit has fully migrated to use the webrecorder/browsertrix-crawler it would be nice to have more available options for the crawler. For the first step adding a
--config
option would be awesome for all advanced use cases, so you can configure webrecorder/browsertrix-crawler to fit your needs (e.g. blockRules, etc.) via a given yaml file.Unfortunately currently any other from zimit unsuported options are rejected with the error: