ghostery / adblocker

Efficient embeddable adblocker library
https://www.ghostery.com
Mozilla Public License 2.0
792 stars 101 forks source link

Extremely slow request decision #179

Closed sentialx closed 5 years ago

sentialx commented 5 years ago

Hi, I noticed after update (0.10 I guess) the engine.match method became super-slow. Here's the code I'm using in my Electron app, and it literally hangs up:

 webRequest.onBeforeRequest(
    { urls: ['<all_urls>'] },
    async (details: Electron.OnBeforeRequestDetails, callback: any) => {
      if (engine && settings.isShieldToggled) {
        console.time('engine.match');
        const { match, redirect } = engine.match(
          Request.fromRawDetails({
            type: details.resourceType as any,
            url: details.url,
          }),
        );
        console.timeEnd('engine.match');

        if (match || redirect) {
          appWindow.webContents.send(`blocked-ad-${details.webContentsId}`);

          if (redirect) {
            callback({ redirectURL: redirect });
          } else {
            callback({ cancel: true });
          }

          return;
        }
      }

      callback({ cancel: false });
    },
  );

image

remusao commented 5 years ago

Hi @sentialx,

Thanks for opening the issue. These timings definitely seem extremely high and that is surprising. The micro-benchmarks that we run before releases do not show any obvious regression so I need to investigate a bit more. Could you point me to the repository and/or some instructions so that I can run your project on my side?

In the future, I'd like to add a simple Electron example here which could be used both as documentation and integration tests to make sure adblocking works as expected in the different environment we aim at supporting. (we already have webextension and puppeteer there).

Best, Rémi

sentialx commented 5 years ago

Sure, https://github.com/wexond/wexond You can either download a 2.2.0 release, or run it by yourself following the Running section.

sentialx commented 5 years ago

Adblocker related code you can find here.

sentialx commented 5 years ago

After downgrade to 0.9.1, tested on https://wp.pl (since it's bloated with ads): image

remusao commented 5 years ago

Thanks. Sorry I am not very familiar with Electron environment. I was able to start the browser locally but not sure where I can access the logs emitted from the adblock service. Is there any specific devtool I need to open? Is the adblocker service enabled by default?

sentialx commented 5 years ago

I didn't push a commit with the console.time benchmarks, but logs from main process (that's where the adblock service is) are in terminal where you run npm run dev.

remusao commented 5 years ago

I added some logs manually. In the terminal I get some exceptions:

(node:5381) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, scandir '/home/user/.wexond/extensions'
(node:5381) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:5381) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

I assume I might need to create a config in /home/user/.wexond/extensions before running the project?

sentialx commented 5 years ago

It throws the error in the first run, but it does nothing. If you apply some changes to the main process, you need to restart npm run dev.

remusao commented 5 years ago

Ok if I create this extensions directory manually I start seeing the logs.

remusao commented 5 years ago

As far as I can see it seems that requests which are not blocked take a huge amount of time to be "processed". I will investigate more to pin-point the exact issue.

remusao commented 5 years ago

@sentialx Here is what is happening. After 0.9.1, we fixed an issue in the way we dispatch filters (a.k.a. tokenization). Filters such as foo.com cannot be assumed to be hostnames because we don't know what the intent of the list creator was. It could be that this filter needs to match URLs like https://somed.domain?query=blablafoo.com/ where this sub-string can appear anywhere. To be conservative and make sure we don't miss this match, the adblocker needs to check the appearance of foo.com substring in all requests. The issue is that the list http://mirror1.malwaredomains.com/files/justdomains is not making use of optimal syntax for hostname blocking. For example the following filter 5bw.ru should instead be written as ||5bw.ru^ to specify that 5bw.ru is in fact a hostname and should match accordingly. This will allow the adblocker to be much more efficient at handling these.

What I would recommend is to either transform this list to use the usual filters syntax, or to stop using this particular list. Maybe the adserver list could be a good replacement. Alternatively, you could adopt the following base lists which I know are well-maintained and offer great coverage:

const lists = {
  easyprivacy: 'https://easylist.to/easylist/easyprivacy.txt',
  easylist: 'https://easylist.to/easylist/easylist.txt',

  'plowe-0': 'https://pgl.yoyo.org/adservers/serverlist.php?hostformat=adblockplus&showintro=0&mimetype=plaintext',

  'ublock-badware': 'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/badware.txt',
  'ublock-filters': 'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters.txt',
  'ublock-unbreak': 'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/unbreak.txt',
  'ublock-privacy': 'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/privacy.txt',
  'ublock-abuse': 'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/resource-abuse.txt',
};

The ublock-abuse fulfills a similar purpose as compared to the nocoin list.

To benefit from redirect rules you might also need to download the content of ublock-resources and initialize the engine like so;

engine = FiltersEngine.parse(data); // the code you already have does not need to be changed

// this needs to be added
const resources = (await Axios.get('https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/resources.txt')).data;
engine.updateResources(resources, '' + resources.length);

I hope that helpers!

remusao commented 5 years ago

Also, I would really like the adblocker to offer a more robust support for Electron out of the box. Similarly to what was done recently for puppeteer, we could create a specific wrapper + example. See the puppeteer wrapper and puppeteer example. Since I am not extremely familiar with this ecosystem, I would love to get some outside contribution here; if that is something you would consider doing, I think it would be an awesome addition for the project + would allow me to make sure that Electron support is great for future releases.

sentialx commented 5 years ago

Thank you for taking your time for investigating this issue. It would never come to my mind that it's a particular list fault.

Also, I would really like the adblocker to offer a more robust support for Electron out of the box...

Sure, I will take a look at this.

sentialx commented 5 years ago

Btw, I'm not sure if I implemented cosmetic filtering correctly. It injects the styles after did-start-navigation event fired, but as I can see in the puppeteer wrapper it's different.

 if (engine && settings.isShieldToggled) {
        if (url === '') return;

        const { styles, scripts } = engine.getCosmeticsFilters({
          url,
          ...parse(url),
        });

        this.webContents.insertCSS(styles);

        for (const script of scripts) {
          this.webContents.executeJavaScript(script);
        }
      }
remusao commented 5 years ago

To make optimal use of cosmetics (both in terms of performance and coverage), the handling needs to be slightly more involved. The way to do it optimally depends on the capabilities. If we have a way to inject "content scripts" in pages via Electron APIs, then you can implement the full capabilities like in WebExtensions. The ideal scenario with cosmetics is this one:

  1. as soon as a frame is created (either main frame of a page or sub-frames/iframes), inject generic cosmetics and scripts
  2. when DOM is ready, we can query the available node IDs, classes and hrefs present in the page, which allows the adblocker to return a very small subset of custom styles to inject in the page
  3. we can optionally monitor changes in the page to detect additions or modifications of elements which should be hidden (e.g.: new ads being loaded). This is done using MutationObserver in the web extension context and maybe there is a way to do the same for Electron.

In puppeteer we only do 1. and 2. at the moment. But probably we can do 3. for Electron.

remusao commented 5 years ago

I can see that there is a way to inject scripts using the setPreloads(...) API (https://electronjs.org/docs/api/session#sessetpreloadspreloads). If that injects scripts in all frames and if there is a way to send messages from these scripts to the main process, then I believe we should be able to implement the full cosmetics (I think no one did that on Electron before?).

sentialx commented 5 years ago

Well, the setPreloads API doesn't inject scripts in all frames, only in the main frame, however this can be done in the preload script. Messaging works without a problem.

(I think no one did that on Electron before?)

Correct. The only available adblocker for Electron is Brave/ad-block, but it doesn't support cosmetic filtering.

remusao commented 5 years ago

Correct. The only available adblocker for Electron was Brave/ad-block, but it doesn't support cosmetic filtering.

Then it would be awesome to figure this out!

I suggest to look at the way it's currently done for WebExtensions. Here is the part running in frames: content.ts. It's injected in all frames and communicate with the background.ts using a simple protocol. Some parts could probably be abstracted away so that the same logic can be used also for Electron.