piercefreeman / grooveproxy

Groove, a crawling and unit test optimized MITM proxy server.
MIT License
17 stars 0 forks source link

Add request passthrough support #20

Closed piercefreeman closed 2 years ago

piercefreeman commented 2 years ago

For some file types we usually don't want to have to route to a third party proxy. Create an optional passthroughRegex that will ignore the 3rd proxy dial and instead use a regular net.Dial. We already do something similar in the end_proxy logic to dynamically switch off depending on proxy settings.

There are two different signals about a passthrough that the proxy server should know about: one is the file type (which we don't know for certain until the response body but a proxy for which is the file URL) and the other is the resourceType. The resource type is an runtime value and only available in a browser-like context.

We can provide some reasonable defaults and/or let clients customize this on their end. We decide to put the logic in the client APIs so the server can remain agnostic to these specific choices.

Resource type passthrough

Manipulate requests explicitly to inject a Proxy-Resource-Type header. Filter out this header within the MITM handling logic but add it to the current proxy request context.

page.routes("**/*", request => {
    page.headers = {...request.headers, "Proxy-Resource-Type": request.resourceType()}
})

See: https://playwright.dev/docs/api/class-request#request-resource-type

Approach 1

Static whitelisting of URLs to passthrough to the server. Something like:

POST /api/requests/passthrough
{
    regex: "...",
    resourceTypes: "..."
}

Approach 2

With Approach 1, we have the downside of having our proxy service host being blocked at the IP level from accessing requests - or client sides ensuring that the URL of pages is only granted to the same IP that accessed it initially.

Introduce the idea of a "connection backoff chain" - a list of multiple different proxies that are tried in order, with some potential metadata that determines when they are used. Only attempt a backoff if errors fail with an actual server status - we don't want to backoff and waste bandwidth when requests fail for other reasons (like the TCP handshake).

Also add the idea of "maxBackoffs" to control how far down the chain we go, assuming that users have a pool of different end proxies.

EndProxyChain(
    proxyUrl: none,
    # open internet, ie. passthrough
    # these filters are or-value by default
    onlyApplies: {
        regex: "..."
        resourceTypes: "..."
    }
)

EndProxyChain(
    proxyUrl: "",
)

Default values

Tracking some good defaults here in case we want to codify this as part of the client defaults.

const STATIC_FILE_REGEX = new RegExp(
  ".*?.(?:txt|json|css|less|js|mjs|cjs|gif|ico|jpe?g|svg|png|webp|mkv|mp4|mpe?g|webm|eot|ttf|woff2?)",
  "i"
);

const STATIC_RESOURCE_TYPES = ["script", "image", "stylesheet", "media", "font"]