Open benoit74 opened 1 week ago
I think we should be cautious about changing browsertrix defaults in zimit. Most of the time, it doesn't matter in zimit but only matters in our use of zimit hence the change should be in zimit conf and/or zimfarm offliner.
That's the case for --failOnFailedLimit
IMO.
On --failOnInvalidStatus
, I think we could consider a good thing in general and change the default but if there's no bundled way to invert the flag (only sets to true), then I think it's not worth it. Keeping strict compatibility with the crawler (ie. only extending it) is an important feature (to me).
I like a lot the idea of keeping strict compatibility / transparency with browsertrix crawler.
I was mislead by the fact that --failOnFailedSeed
is passed unconditionally, but this probably a bit different. Even if I think we should change this as well, i.e. expose the CLI argument as-is and only set this as true by default in zimfarm offliner.
We need to also expose these arguments which I missed at first look:
--blockRules Additional rules for blocking certai
n URLs from being loaded, by URL reg
ex and optionally via text match in
an iframe [array] [default: []]
--blockMessage If specified, when a URL is blocked,
a record with this error message is
added instead[string] [default: ""]
--blockAds, --blockads If set, block advertisements from be
ing loaded (based on Stephen Black's
blocklist)
[boolean] [default: false]
--adBlockMessage If specified, when an ad is blocked,
a record with this error message is
added instead[string] [default: ""]
why you don't add this one : --combineWARC, it'll be useful because most crawling tasks use --keep .
In zimit, combining WARCs is just a significant waste of computing resources and storage, since it means all records have to be parsed, and transferred to a new "combined" file.
The combined file is useful only if you want to transport the WARC as one single file, which is not our use case.
Most crawling tasks use --keep
indeed, but it is then the Zimfarm responsibility to create a tar (much more efficient than a combined WARC) of these files. And this is only in the Zimfarm usage.
I'm not against exposing this parameter, but I don't get what would be the usage for us or for an end-user
I didn't know much about this, but in my case when I open for example the first warc of two warcs in the archive folder, It doesn't display the whole content with replay web.page desktop app, it needs the other one, so I have searched and found it as an argument in browsertrix. so I thought to tell you how to deal with this.
ilya talks:
A quick solution is to combine all WARC files into one, which can be done via command-line, for example: cat *.warc.gz > ./combined/all.warc.gz
but how to do this on windows command line; I have no idea.
Edit: I have used the command "type" and I got a working combined file. could you add this parameter for a temporary time to try it out . If not, no problem. thank you for clarifying this.
We have some "new" (some are few months old ...) CLI argument of browsertrix crawler to consider:
For seed urls, I propose to use
--seedFile
, and (if not already the case) support a URL from which to fetch this file (to be done in browsertrix crawler directly preferably).For
--failOnFailedLimit
and--failOnInvalidStatus
, I think we should expose these two arguments and changing their defaults values:100
for--failOnFailedLimit
andtrue
for--failOnInvalidStatus
. Both would be sensible defaults to warn the user something bad is happening and they should confirm they want to continue. If we agree on this, and since having atrue
default on a boolean flag prevent from unsetting it, we should expose--doNotFailOnInvalidStatus
at zimit level, instead of--failOnInvalidStatus
.For sitemap arguments, I propose to use
--sitemapFromDate
and--sitemapToDate
for clarity (plus they are the real name used, the variant is an alias in browsertrix crawler codebase).For
--selectLinks
, we need to expose this CLI argument and (contrary to what I said on Monday) modify the default value toa[href]->href,area[href]->href
(users would probably expect us to also explore these pages in most cases, and should it cause a problem one can customize it by setting the CLI argument).For
--postLoadDelay
, nothing special but add it.@rgaudin @kelson42 any thoughts?