Open matteocargnelutti opened 1 year ago
--exclude
/ --blockRules
to limit what the crawler can seeBy default, browsertrix
only accepts http://
and https://
urls.
This greatly reduces the risk of accidental capture of file://
or chrome://
urls for example.
Invalid Seed "chrome://settings" - URL must start with http:// or https://
The --exclude
param seem to be designed more as a way to exclude certain urls / paths "down the road", in a multi-page crawling scenario.
I am not sure it matches our use case, and I think it would make sense to implement our own filtering at API level before adding an URL to the queue, for example to exclude certain IP ranges.
<iframe>
: option available βοΈ--blockRules
works as intended for this use case.
I was able to use it to prevent the crawler from capturing the content of an <iframe>
which was pointing to a domain only accessible from the network the crawler is on.
The slight downside is that we'll likely spend quite some time devising and testing these regular expressions.
Allow list of protocols: good defaults β
By default, browsertrix only accepts http:// and https:// urls. This greatly reduces the risk of accidental capture of file:// or chrome:// urls for example.
(Belatedly capturing a discussion from last week or the week before)
Potentially to be discussed: how things go if you pass in basic auth. I think the Pydantic validator is fine with that, but am not sure... Perma presently forbids that for target URLs; I don't know to what extent we are committed to that decision/would need it to be enforced at this level.
Identified capture constraints
.warc.gz
and.wacz
files.Evaluating browsertrix-crawler
Setting limits:
Benchmarking and misc:
Signing
In parallel