gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
22.71k stars 1.74k forks source link

Can't use proxy tools like scraproxy #338

Open Monkleys opened 5 years ago

Monkleys commented 5 years ago

Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests

go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.

Would there be any way to get around this?

alaaelgndy commented 2 years ago

Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests

go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.

Would there be any way to get around this?

Have you tried SOCKS4/5 to solve this problem?

fabienvauchelles commented 5 months ago

If you are interested, Scrapoxy 4 is out:

Scrapoxy is a open source proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers πŸ•ΈοΈ.

Smartly designed for efficient traffic routing πŸ”€, Scrapoxy minimizes #bans and boosts success rates πŸš€.

The tech stack is built on the latest NodeJS, Typescript, utilizing the NestJS and Angular frameworks.

Here are the key features:

Checkout https://scrapoxy.io/ !