WIP: Replace outdated adblockparser dependency with newer adblock

MRuecklCC commented 2 years ago

The new adblock dependency is a wrapper around a rust library. It is expected to perform much better than the regex based old package and using it removes a couple of workarounds that were meant to speed up the old implementation (e.g. the google-re2 hack).

In the course of exchanging the dependency, this commit also changes the behaviour of the actual rule based extractors:

They no longer use the "raw-links" extracted from the received html.
Instead they use the requests recorded when loading the page. This means, that a lot less urls need to be checked against the adblock rules, which should further speed up the analysis. This also means, that a static link that is present in the html, that previously would have triggered the extractor will no longer do so.

TODO:

[ ] fix serialization of engine to move rule evaluation to process pool
[ ] alternatively: make sure that rule evaluation with the new implementation is negiligible and does'nt benefit from dispatch to process pool
[x] merge #164 and rebase
[ ] fix unit test: at the moment it seems, that e.g. the script block based filter in the test does not behave as expected.
[ ] investigate why poetry.lock is so much bigger

MRuecklCC commented 2 years ago

@RMeissnerCC FYI

MRuecklCC commented 2 years ago

I dug a little deeper into this.

The adblock library seems to work quite nicely and mostly does what I expect, e.g. it seems to correctly handle the image & script resource types.

While digging I also read more about how adblockers work in general. Very much simplified they seem to do two different things:

block outgoing requests and prevent loading ads in the first place via domain filters and the resource type that is loaded (script, image, stylesheet,...)
remove elements from the DOM (e.g. if the element was supposed to show the loaded content) via matching rules on the HTML elements classes or id.

The old implementation did only the second thing, i.e. it validated all sorts of classes and ids of all the elements of the DOM against all the blocking rules. This is why it took a long time (lots of elements x lots of rules).

IMHO it would be a lot more reasonable to distinguish more. E.g.:

E.g. https://raw.githubusercontent.com/easylist/easylist/master/easylist/easylist_adservers_popup.txt contains filters for popups which (likely) would never be triggered by the old implementation.
Other files contain mostly rules for the element hiding. (e.g. https://raw.githubusercontent.com/easylist/easylist/master/easylist/easylist_specific_hide.txt)
And then other files seem to mostly contain rules for blocking requests

With the current playwright browser approach, we can actually track the popups (via page.on("popup", lambda p: print("popup!")).

We can also separately track the requests with their individual resource types (as already implemented in this PR).

What got lost - and what I didn't consider as that important initially - was the removal of elements from the DOM. This seems to also be supported by the adblock library (e.g. with https://docs.rs/adblock/latest/adblock/engine/struct.Engine.html#method.hidden_class_id_selectors).

On a second thought these are e.g. quite important to detect e.g. cookie banners.

openeduhub / metalookup

WIP: Replace outdated adblockparser dependency with newer adblock #166