Can't use proxy tools like scraproxy

Monkleys commented 5 years ago

Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests

go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.

Would there be any way to get around this?

alaaelgndy commented 2 years ago

Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests

go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.

Would there be any way to get around this?

Have you tried SOCKS4/5 to solve this problem?

fabienvauchelles commented 5 months ago

If you are interested, Scrapoxy 4 is out:

Scrapoxy is a open source proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers 🕸️.

Smartly designed for efficient traffic routing 🔀, Scrapoxy minimizes #bans and boosts success rates 🚀.

The tech stack is built on the latest NodeJS, Typescript, utilizing the NestJS and Angular frameworks.

Here are the key features:

☁️ Cloud Providers with easy installation: Scrapoxy supports many cloud providers like AWS, Azure, or GCP.
🌐 Proxy Services: Scrapoxy supports many proxy services like Rayobyte, IPRoyal or Zyte.
💻 Hardware materials: Scrapoxy supports many 4G proxy farms hardware types, like Proxidize or XProxy.io.
📜 Free Proxy Lists: Scrapoxy supports lists of HTTP/HTTPS proxies and SOCKS4/SOCKS5 proxies.
⏰ Timeout free: Scrapoxy only routes traffic to online proxies to avoid inactive connection.
🔄 Auto-Rotate proxies: Scrapoxy automatically changes IP addresses at regular intervals.
🏃 Auto-Scale proxies: Scrapoxy monitors incoming traffic and automatically scales the number of proxies according to your needs.
🍪 Sticky sessions on Browser: Scrapoxy keeps the same IP address for a scraping session, even for browsers.
🚨 Ban management: Scrapoxy injects the name of the proxy into the HTTP responses.
📡 Traffic interception: Scrapoxy intercepts HTTP requests/responses to modify headers, keeping consistency in your scraping stack. It can add session cookies or specific headers like user-agent.
📊 Traffic monitoring: Scrapoxy measures incoming and outgoing traffic to provide an overview of your scraping session.
🌍 Coverage monitoring: Scrapoxy displays the geographic coverage of your proxies to better understand the global distribution of your proxies.
🚀 Easy-to-use and production-ready: Scrapoxy is suitable for both beginners and experts (Kubernetes / Helm).
🔓 Open Source: And of course, Scrapoxy is open source, under the MIT license.

Checkout https://scrapoxy.io/ !

gocolly / colly

Can't use proxy tools like scraproxy #338