Open MRuecklCC opened 2 years ago
@RMeissnerCC FYI
I dug a little deeper into this.
The adblock library seems to work quite nicely and mostly does what I expect, e.g. it seems to correctly handle the image & script resource types.
While digging I also read more about how adblockers work in general. Very much simplified they seem to do two different things:
The old implementation did only the second thing, i.e. it validated all sorts of classes and ids of all the elements of the DOM against all the blocking rules. This is why it took a long time (lots of elements x lots of rules).
IMHO it would be a lot more reasonable to distinguish more. E.g.:
With the current playwright browser approach, we can actually track the popups (via page.on("popup", lambda p: print("popup!"))
.
We can also separately track the requests with their individual resource types (as already implemented in this PR).
What got lost - and what I didn't consider as that important initially - was the removal of elements from the DOM. This seems to also be supported by the adblock
library (e.g. with https://docs.rs/adblock/latest/adblock/engine/struct.Engine.html#method.hidden_class_id_selectors).
On a second thought these are e.g. quite important to detect e.g. cookie banners.
The new adblock dependency is a wrapper around a rust library. It is expected to perform much better than the regex based old package and using it removes a couple of workarounds that were meant to speed up the old implementation (e.g. the google-re2 hack).
In the course of exchanging the dependency, this commit also changes the behaviour of the actual rule based extractors:
TODO: