openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
363 stars 25 forks source link

Consider "new" crawler CLI arguments #433

Open benoit74 opened 1 week ago

benoit74 commented 1 week ago

We have some "new" (some are few months old ...) CLI argument of browsertrix crawler to consider:

      --seedFile, --urlFile                 If set, read a list of seed urls, on
                                            e per line, from the specified
                                                                        [string]
      --failOnFailedLimit                   If set, save state and exit if numbe
                                            r of failed pages exceeds this value
                                                           [number] [default: 0]
      --failOnInvalidStatus                 If set, will treat pages with 4xx or
                                             5xx response as failures. When comb
                                            ined with --failOnFailedLimit or --f
                                            ailOnFailedSeed may result in crawl
                                            failing due to non-200 responses
                                                      [boolean] [default: false]      
      --sitemapFromDate, --sitemapFrom      If set, filter URLs from sitemaps to
                                             those greater than or equal to (>=)
                                             provided ISO Date string (YYYY-MM-D
                                            D or YYYY-MM-DDTHH:MM:SS or partial
                                            date)                       [string]
      --sitemapToDate, --sitemapTo          If set, filter URLs from sitemaps to
                                             those less than or equal to (<=) pr
                                            ovided ISO Date string (YYYY-MM-DD o
                                            r YYYY-MM-DDTHH:MM:SS or partial dat
                                            e)                          [string]
      --selectLinks                         One or more selectors for extracting
                                             links, in the format [css selector]
                                            ->[property to use],[css selector]->
                                            @[attribute to use]
                                            [array] [default: ["a[href]->href"]]
      --postLoadDelay                       If >0, amount of time to sleep (in s
                                            econds) after page has loaded, befor
                                            e taking screenshots / getting text
                                            / running behaviors
                                                           [number] [default: 0]

For seed urls, I propose to use --seedFile, and (if not already the case) support a URL from which to fetch this file (to be done in browsertrix crawler directly preferably).

For --failOnFailedLimit and --failOnInvalidStatus, I think we should expose these two arguments and changing their defaults values: 100 for --failOnFailedLimit and true for --failOnInvalidStatus. Both would be sensible defaults to warn the user something bad is happening and they should confirm they want to continue. If we agree on this, and since having a true default on a boolean flag prevent from unsetting it, we should expose --doNotFailOnInvalidStatus at zimit level, instead of --failOnInvalidStatus.

For sitemap arguments, I propose to use --sitemapFromDate and --sitemapToDate for clarity (plus they are the real name used, the variant is an alias in browsertrix crawler codebase).

For --selectLinks, we need to expose this CLI argument and (contrary to what I said on Monday) modify the default value to a[href]->href,area[href]->href (users would probably expect us to also explore these pages in most cases, and should it cause a problem one can customize it by setting the CLI argument).

For --postLoadDelay, nothing special but add it.

@rgaudin @kelson42 any thoughts?

rgaudin commented 1 week ago

I think we should be cautious about changing browsertrix defaults in zimit. Most of the time, it doesn't matter in zimit but only matters in our use of zimit hence the change should be in zimit conf and/or zimfarm offliner.

That's the case for --failOnFailedLimit IMO.

On --failOnInvalidStatus, I think we could consider a good thing in general and change the default but if there's no bundled way to invert the flag (only sets to true), then I think it's not worth it. Keeping strict compatibility with the crawler (ie. only extending it) is an important feature (to me).

benoit74 commented 1 week ago

I like a lot the idea of keeping strict compatibility / transparency with browsertrix crawler.

I was mislead by the fact that --failOnFailedSeed is passed unconditionally, but this probably a bit different. Even if I think we should change this as well, i.e. expose the CLI argument as-is and only set this as true by default in zimfarm offliner.

We need to also expose these arguments which I missed at first look:

      --blockRules                          Additional rules for blocking certai
                                            n URLs from being loaded, by URL reg
                                            ex and optionally via text match in
                                            an iframe      [array] [default: []]
      --blockMessage                        If specified, when a URL is blocked,
                                             a record with this error message is
                                             added instead[string] [default: ""]
      --blockAds, --blockads                If set, block advertisements from be
                                            ing loaded (based on Stephen Black's
                                             blocklist)
                                                      [boolean] [default: false]
      --adBlockMessage                      If specified, when an ad is blocked,
                                             a record with this error message is
                                             added instead[string] [default: ""]
hamoudak commented 1 week ago

why you don't add this one : --combineWARC, it'll be useful because most crawling tasks use --keep .

benoit74 commented 1 week ago

In zimit, combining WARCs is just a significant waste of computing resources and storage, since it means all records have to be parsed, and transferred to a new "combined" file.

The combined file is useful only if you want to transport the WARC as one single file, which is not our use case.

Most crawling tasks use --keep indeed, but it is then the Zimfarm responsibility to create a tar (much more efficient than a combined WARC) of these files. And this is only in the Zimfarm usage.

I'm not against exposing this parameter, but I don't get what would be the usage for us or for an end-user

hamoudak commented 1 week ago

I didn't know much about this, but in my case when I open for example the first warc of two warcs in the archive folder, It doesn't display the whole content with replay web.page desktop app, it needs the other one, so I have searched and found it as an argument in browsertrix. so I thought to tell you how to deal with this.

ilya talks:

A quick solution is to combine all WARC files into one, which can be done via command-line, for example: cat *.warc.gz > ./combined/all.warc.gz

but how to do this on windows command line; I have no idea.

Edit: I have used the command "type" and I got a working combined file. could you add this parameter for a temporary time to try it out . If not, no problem. thank you for clarifying this.