scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.1k stars 513 forks source link

How can i crawler a url use a custom proxy #160

Closed fivesmallq closed 9 years ago

fivesmallq commented 9 years ago

I look at the documents, I found that do not support direct proxy parameters.

For example:

curl 'http://localhost:8050/render.html?url=http://xxx.com/a.html&proxy=localhost:8080'

I want to crawl http://xxx.com/a.html directly when designated agent for localhost:8080, whether to consider adding such a parameter, or use a special http header?

In our application, the proxy service is maintained independently, will be constantly updated, so I can not write directly in the file, he is constantly changing, so I'm hoping directly set a proxy configuration when i crawl some url.

kmike commented 9 years ago

Hi @fivesmallq,

It makes sense to allow passing proxy information directly. We implemented Proxy Profiles when Splash only supported passing arguments via GET requests - passing big whitelists and blacklists via GET is ugly and unreliable because of URL length limits and lack of common format for nested data in GET parameters. Now as parameters can be passed via application/json POST requests we can look at it again; this feature is missing, +1 to implement it.

In the meantime I wonder if you can just create a proxy profile for each possible proxy (without a whitelist/blacklist) and pass its name using proxy argument.

microhello commented 9 years ago

That will be more simplely use if can support proxy server ip:port as parameter of request , +1 .

kmike commented 9 years ago

Not exactly what you're proposing, but there is now splash:on_request() method.