ulixee / secret-agent

The web scraper that's nearly impossible to block - now called @ulixee/hero
https://secretagent.dev
MIT License
672 stars 44 forks source link

set different proxies for different tabs? #92

Closed Nisthar closed 3 years ago

Nisthar commented 3 years ago

is it possible to set different proxies for different tabs using secret agent?

blakebyrnes commented 3 years ago

Hi @Nisthar, you can set a different proxy for every "Session" in SecretAgent, ie, every time you call new SecretAgent(). The calls to UpstreamProxy.acquireProxyUrlFn will include the session of each url that gets passed in.

Quick questions: 1) Are you using one SecretAgent instance and opening multiple tabs on it that you would consider to each be it's own "scrape"? 2) If not, can you help explain why you want a different proxy for each tab?

Nisthar commented 3 years ago

@blakebyrnes I normally use puppeteer for scraping. It doesn't support proxies for each tabs. I would need to set a proxy when a browser instance start using command line arguments. It would save a lot of cpu resources if you can set proxy for each tab.

blakebyrnes commented 3 years ago

Ok, that helps. Secret Agent is sharing a single chrome instance underneath, and each "session" (ie, new SecretAgent()) is actually just an incognito window. "Tabs" are our name for individual tabs within each session. If you click on a link that opens a new target within the same "scrape", it opens a new tab within the same Incognito Window. Our intent is that you would create a new SecretAgent instance for the next scrape.

If you create a new Secret Agent, it will open a new Incognito window that can use its own proxy if you set it. So you should be getting the savings you're looking for out of the box.

We're still experimenting with the best way to convey this with the API. Have you experimented with the code, or are you just reading the docs at the moment?

Nisthar commented 3 years ago

@blakebyrnes I am just reading the docs. Its looks amazing. I would love to see some example code for proxy setting.

blakebyrnes commented 3 years ago

Thanks! We definitely need to add an example for the proxy settings. Thanks for catching that.

Essentially, all you do is provide the proxy you want to use for each request or session. If you wanted to track by session, you'd get your sessionId out of the SecretAgent instance.

const secretAgent = new SecretAgent();
const session1 = await secretAgent.sessionId;
UpstreamProxy.acquireProxyUrlFn = async (request) => {
  if (request.sessionId === session1) {
     return `https://yourproxy.com:3233`;
  }
};

Presumably, you'd make the function smart enough to handle whatever you need to do to route your various sessions/scrapes.

Nisthar commented 3 years ago
  1. If not, can you help explain why you want a different proxy for each tab?

@blakebyrnes doesn't setting proxies for tabs saves more resources than setting it on different sessions or it doesn't make much difference (provided you want to open many tabs/sessions)?

blakebyrnes commented 3 years ago

I wouldn't think it would make much difference, but it depends on what your proxy is doing. There's probably a bit of cpu to do an https connect if your proxy is remote and runs over https, but I'm not thinking of anything else that would spike cpu.

The idea of tabs/sessions is to sandbox what looks to a site like a "user". So I wouldn't think you'd want the IP to rotate between tabs that occur in a single user session.

Nisthar commented 3 years ago

@blakebyrnes I tried the code you posted for upstream proxy, its not changing the ip for me. Tried opening https://whatismyipaddress.com/ after the code, it still shows my original ip address.

blakebyrnes commented 3 years ago

Can you share a code snippet?

Nisthar commented 3 years ago

@blakebyrnes This is my code:

const SecretAgent = require('secret-agent');
process.env.SHOW_BROWSER = 'true';
process.env.SA_REPLAY_DEBUG = '1';
const BASE_URL = 'https://whatismyipaddress.com/';
import { UpstreamProxy } from '@secret-agent/mitm';

const better_sqlite3 = require('better-sqlite3');
let isStarted = false;
let agent = "";

async function start(){
        agent = await new SecretAgent({ humanEmulatorId: 'basic' });
        const session1 = await agent.sessionId;
        UpstreamProxy.acquireProxyUrlFn = async (request) => {
                if (request.sessionId === session1) {
                        return `http://134.209.222.174:8080`;
                }
        };
        await agent.goto(BASE_URL);
        isStarted = true;
     } catch (e) {
        console.log(e)
     }
}
blakebyrnes commented 3 years ago

@Nisthar Ok, cool. Thanks for the example. I'm seeing this too. I think we're gonna simplify how this works.

blakebyrnes commented 3 years ago

@Nisthar I pushed a new version with a simplified model for configuring proxies. Also added support for socks5 proxies. Let me know if you have better luck with this version. For what it's worth, I've found the state of the "free" http and socks5 proxies to be pretty lacking.

xTRiM commented 3 years ago

@blakebyrnes Am I getting it right that now we only have an option to set proxy for a new browser and can't change them for each request? (dropped mitm)

const agent = await new SecretAgent({
   upstreamProxyUrl: `http://127.0.0.1:8000`
});
blakebyrnes commented 3 years ago

@xTRiM We're still trying to find the right structure to communicate "what" a new SecretAgent is. Sorry if I'm repeating things you already understand, but a SecretAgent instance isn't like launching a whole instance of Chrome. It's more analagous to an Incognito Window on top of a shared Chrome instance underneath.

So this new approach DOES limit you from changing the proxy for every url that comes through in a single "session". Our expectation was that you would not want to switch IPs during a single session - the equivalent of a "user" interacting with a site.

All that said, please share your use case for wanting to swap an IP per request and we will try to accommodate!

xTRiM commented 3 years ago

@blakebyrnes I did not understood that, I thought it handles browser instances similar to Puppeteer. I'm happy that I was wrong :)

Current simplified approach makes total sense and works great. I also can't imagine why would I change proxies inside of a single "session".

P.S. I've just stumbled upon SecretAgent yesterday while searching for a mitm solution to change proxies in Puppeteer without reopening the browser. Now I'm replacing Puppeteer with SecretAgent. Mind blowing work, thank you @blakebyrnes and @calebjclark!

Nisthar commented 3 years ago

Thanks. Appreciate it. I started using puppeteer with a proxy-chain server. I might switch to secret agent now.

Nisthar commented 3 years ago

@blakebyrnes Its working now 👍
I have been reading the doc, i found that you can do Tab.Request function to do requests. I am wondering isn't it possible to set proxy on this function ?

blakebyrnes commented 3 years ago

Hi @Nisthar, That Request object is identical to the Dom Request object. It will inherit the proxy that you set for a SecretAgent object, as it really lives inside the webpage you load that will be the "context" of your requests (ie, referer, user agent, etc).

So... yes, but it's still at a SecretAgent level