ulixee / hero

The web browser built for scraping
MIT License
745 stars 39 forks source link

Ability to abort url request #164

Closed jamesike closed 1 year ago

jamesike commented 3 years ago

Hello, I was wondering if there was a way to be able to intercept a URL request, and abort it before the request is able to be sent

blakebyrnes commented 3 years ago

Hi @jamesike, there's not currently any ability to block specific requests. We do currently support blocking resources on a more general level: blockedResourceTypes option (https://secretagent.dev/docs/overview/configuration#blocked-resources.

Are you wanting to block a pattern of urls? Or inspect actual requests and block one-by-one?

janisblaus commented 2 years ago

+1 for this feature.

something like puppeteers

page.on('request', request => {

});

would be so awesome to have here.

perhaps there is some sort of workaround @blakebyrnes ?

blakebyrnes commented 2 years ago

@janisblaus I definitely want to support blocking urls with a list of wildcard url patterns, but I'm hesitant to add a feature to push every resource to the client to get approval before proceeding in the Man-in-the-middle. Our client/server model just makes something like this feel very heavy. Would a url blocking pattern solve your use-case? Or do you need to inspect them one by one and look at things like header/body/etc?

janisblaus commented 2 years ago

@janisblaus I definitely want to support blocking urls with a list of wildcard url patterns, but I'm hesitant to add a feature to push every resource to the client to get approval before proceeding in the Man-in-the-middle. Our client/server model just makes something like this feel very heavy. Would a url blocking pattern solve your use-case? Or do you need to inspect them one by one and look at things like header/body/etc?

Pattern blocking sounds limiting. On puppeteer I use this to inspect requests, retrieve headers and in some specific cases - even rewrite them.

blakebyrnes commented 2 years ago

I've definitely done that too, but it was usually to fix headers to match what they should have been for Chrome - which should not be a use case for SecretAgent. You can currently inspect every resource that comes through and inspect headers, etc, and you can write your own plugin to manipulate any headers. That approach works if you have a pattern of things you want to do for Every request, but it's a bit of the wrong approach if this is unique per scrape. Do you have cases where you've wanted to do something different on a per-scrape basis?

janisblaus commented 2 years ago

I guess writing a plugin would be enough for such a use case, yes, will definitely look into it.

blakebyrnes commented 1 year ago

This feature should be implemented by existing "request" events along with an "abort" function on them. We might need a mode here that "pauses" the request until a client responds with a continue or abort... This is a good place to also allow a "hook" to wait for the response body as a ReadableStream.

NOTE: we need discussion on whether the default "request" event should active an "abort" feature or if that's an additional function.

NOTE 2: we could also consider implementing this as a reference plugin.

GlenDC commented 1 year ago

I agree that ideally this can be done as a plugin, which can be one of the available plugins in a ulixee-controlled repo/folder of plugins available for opt-in.

I also would very much like this, as it allows you to save on a lot of wasted resources, given how much unrequired crap most websites download in the background these days...

blakebyrnes commented 1 year ago

The shorter path for much of this need is to simply add a blockedResourceUrls as a config to Hero (series of regexes or strings). It's already part of the Mitm code and interceptorHandlers pattern, we just need the configuration. Maybe we should log this separately - I think it's the feature you want.

A request interception plugin is a bit of a more advanced scenario where you need to block and/or modify post data, response data, etc. The only time you need the ability to just plain abort if we add the blockedResourceUrls is if you just don't know what the url will look like (or there are 2 good ones, and a 3rd bad one you want to block).

GlenDC commented 1 year ago

I think it's the feature you want.

Sounds about right. Okay, let's go for that :)

blakebyrnes commented 1 year ago

Here's a good starting place: you can see the existing configs coming in (they'll need to be added to client): https://github.com/ulixee/hero/blob/f7b3d0d07931fd8e06b5c6fa8c8477a4577450e3/core/lib/Tab.ts#L820

Regexps will automatically traverse the connection, but for places where we match urls, it's usually allow plain strings too (an example of this is how we do waitForResource)

rjbks commented 1 year ago

A request interception plugin is a bit of a more advanced scenario where you need to block and/or modify post data, response data, etc.

If that's what I'm looking for, what a good starting point, assuming a plugin is the place to do this? Are there any specific examples I reference?

My use case would be intercept a script resource, modify it and return the modified script.

blakebyrnes commented 1 year ago

My use case would be intercept a script resource, modify it and return the modified script.

I think we need to make some minor modifications to the current plugin structure to support this. Likely we should add an additional callback to the beforeHttpRequest and beforeHttpResponse calls that indicates you have handled request processing and would like to halt request processing.

If you simply need a temporary way around this, you can subscribe to new Agent creation directly in HeroCore (note that this is a semi-internal api and is not documented because of that).

import HeroCore from '@ulixee/hero-core';

await HeroCore.start(); // has to be started before you can register event. You can also do this by starting a Ulixee Server
HeroCore.pool.on('agent-created', ({agent}) => {
  agent.mitmRequestSession.interceptorHandlers.push({
      urls: ['SCRIPT_URL', new RegExp('Or regex')],
      handlerFn(url, type, request, response) {
        response.end(`<YOUR SCRIPT>`);
        return true;
      },
    });
});
GlenDC commented 1 year ago

blockedResourceUrls is now available and exposed. Still have to add automated tests.

rjbks commented 1 year ago

HeroCore.pool.on('agent-created', ({agent}) => { agent.mitmRequestSession.interceptorHandlers.push({ urls: ['SCRIPT_URL', new RegExp('Or regex')], handlerFn(url, type, request, response) { response.end(<YOUR SCRIPT>); return true; }, }); });

How would I then proceed to get the script contents by making the request from the browser tab context that originated the request? I can't seem to find anything in the agent object or the browserContext.