philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.02k stars 29 forks source link

Custom request headers #9

Closed philippta closed 9 months ago

philippta commented 10 months ago

Custom request headers should be supported as a headers config option. A new headers module should be created for this.

Proposed example:

export const config = {
  headers: {
    "Authorization": "Bearer ey...",
    "User-Agent": "Mozilla/5.0 ...",
  },
};

Ref:

rafiramadhana commented 9 months ago

@philippta can i work on this? thanks

rafiramadhana commented 9 months ago

btw the idea of structuring code in modules is quite interesting

may i know your references (e.g. other repos or articles) regarding that modules code structuring?

philippta commented 9 months ago

@rafiramadhana Yes, you can work on this. Few things to consider:

flyscrape.RequestBuilder would be the easiest way to support custom headers, however this hook is only called in the scraper and not for file downloads.

Instead you could use the flyscrape.TransportAdapter hook, which intercepts all requests, including file downloads.

Here is an example for a TransportAdapter: https://github.com/philippta/flyscrape/blob/94da9293f63e46712b0a890e1e0eab4153fdb3f9/modules/proxy/proxy.go#L48-L57

TransportAdapters are also a bit special though. They have to be applied in a specific order which are specified here. The bottom most adapter is applied first, which is what you want for headers. https://github.com/philippta/flyscrape/blob/eae10426cd805ecc0a0459b61639e48e6cd913ad/module.go#L94-L100


may i know your references (e.g. other repos or articles) regarding that modules code structuring?

It is similar to how Caddy Modules work, but way less elaborate.

rafiramadhana commented 9 months ago

The bottom most adapter is applied first, which is what you want for headers.

sry, i'm a bit confused by this sentence because there are bottom most and first in one sentence

do you mean,

"The bottom most adapter (the AdaptTransport impl of headers module) is applied first (the headers module should be put at first in moduleOrder), which is what you want for headers."

 moduleOrder = []string{ 
    // Transport adapters must be loaded in a specific order. 
    // All other modules can be loaded in any order. 
        "headers",  // New `headers` module
    "proxy", 
    "ratelimit", 
    "cache", 
 } 
philippta commented 9 months ago

TL;DR The "headers" module should be last in the moduleOrder list like so, but let me explain.

 moduleOrder = []string{ 
    "proxy", 
    "ratelimit", 
    "cache", 
        "headers",  // New `headers` module
 } 

For reference:

type TransportAdapter interface {
    AdaptTransport(http.RoundTripper) http.RoundTripper
}

The AdaptTransport takes a http.RoundTripper and returns a new (wrapped/adapted) http.RoundTripper similar to how HTTP middlewares work in almost all Go routers/web-frameworks.

We can use the http.DefaultTransport as a starting point and adapt it with more functionality like so:

myClient := http.Client{
    Transport: moduleA.AdaptTransport(http.DefaultTransport),
}

If we had another module, we can adapt the already adapted transport.

myClient := http.Client{
    Transport: moduleB.AdaptTransport(moduleA.AdaptTransport(http.DefaultTransport)),
}

We could do this infinitely further to add more and more adapters to the call chain.

Ultimately the request would the travel like this:

http.Client (sends request) -> moduleB -> moduleA -> http.DefaultTransport -> Internet

Now to, why the reverse order:

finaltransport := http.DefaultTransport

for _, mod := range allModules { // 1. proxy, 2. ratelimit, 3. cache, 4. headers
        finaltransport = mod.AdaptTransport(finaltransport)
}

http.Client{
        Transport: finaltransport, // headers(cache(ratelimit(proxy(http.DefaultTransport)))
}

The last module is going to be the outer most in the onion like call chain, which can mangle the HTTP request first.

Hope that makes sense 🙏

rafiramadhana commented 9 months ago

The last module is going to be the outer most in the onion like call chain, which can mangle the HTTP request first.

Hope that makes sense 🙏

i see, thanks for the pointers