ulixee / secret-agent

The web scraper that's nearly impossible to block - now called @ulixee/hero
https://secretagent.dev
MIT License
672 stars 44 forks source link

How to add particular header to every request? #137

Closed dmzebrov closed 3 years ago

dmzebrov commented 3 years ago

I want to add cache-control and pragma headers to every request with value 'no-cache'. It seems like my target site (street-beat.ru) somehow understands that I am using proxy, so I can't pass through to the actual site page. I thought that requests caching by proxy may be the problem.

dmzebrov commented 3 years ago

I've added this headers to header.json from /emulate-chrome-83, but it doesn't help...

blakebyrnes commented 3 years ago

Hi @dmzebrov, we don't have that capability quite yet, but it sounds like something we should add. Would you be expecting to add to every single http request, or would you want the ability to add to certain urls/domains?

Can you tell me more about the proxy setup you have? Are you using the same IP for multiple sessions? How are you figuring out that they're detecting your proxy?

dmzebrov commented 3 years ago

@blakebyrnes, thank you for your quick answer. I think that in my case it would be sufficient to add headers only to certain url request. The ability to add headers to certain urls/domains is more flexible, but I do not exclude situations where adding headers to every single http request is needed.

At this moment I am using Luminati proxy setup with Data Center proxies and LPM. I also have tried proxies from other providers.

Presently, I am trying to pass Variti security protection, that is used on street-beat.ru, brandshop.ru and on some other sites too. So yes, while I am testing it is happens that I am using same IP for different sessions but with pretty low request rate. From time to time I am changing IP but it have no effect.

I am not 100% sure that they are detecting if I am actually using proxy and not just block request from certain IP address, but without upstreamProxy all requests just passes through without any problems. Also I’ve tried to set proxy systemwide and got the same result - page doesn’t loads. I’ve already tried ~15 different proxies from different proxy providers. Maybe 2 of this 15 proxies (from luminati, Static Residential) have worked only for 1 page load and then stop work. The aforementioned sites firstly load static html containing only js script with hardcoded JSencrypt’s public and private keys, that are decoded and the result of decoding is sent with the next request in cookies. If all goes well - requested page is loaded. As JSencrypt’s keys should be different on every request, I assume that this problem can come from some type of response caching for seems-like-static html files that proxy can perform.

blakebyrnes commented 3 years ago

@dmzebrov thanks for the feedback. We'll add interception to an upcoming release (not sure when yet).

Your scenario seems somewhat surprising. For what it's worth, we have an internal cache that is doing roughly what you're hoping for - if you're digging into the code, it happens in mitm/handlers/CacheHandler. It seems like this should be breaking your tests with no proxy if it's a root cause of their detection.

If you're digging deep for answers, you can look at the database for a session (in: ${os.tmpdir()}/.secret-agent/sessions/${sessionId}). In the Resources table, you can see if particular requests are using the artificial cache using the usedArtificialCache column.

NOTE: the structure of the database is not considered "public api" and may change. Just FYI.

dmzebrov commented 3 years ago

@blakebyrnes thank you for your assistance! It seems like caching was not the root of my problem.

I checked if this headers affect behaviour of requests by adding them to request with LPM and as it turned out - they don't. I also checked usedArtificialCache parameter for this requests as you recommended which value is 0 in my case.

But nonetheless, enabling developers to add headers to requests seems like a good idea.

blakebyrnes commented 3 years ago

This is now possible using a Plugin and intercepting the beforeHttpRequest method (https://secretagent.dev/docs/plugins/core-plugins). For now, I think this is our recommended approach. If you wanted to be able to customize headers from the client, you would add a client plugin too.