scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Support for socks5 proxy #747

Open cydu opened 9 years ago

cydu commented 9 years ago

Support for socks5 proxy

http://www.ietf.org/rfc/rfc1928.txt

maybe we can use https://github.com/habnabit/txsocksx 's SOCKS5Agent

pablohoffman commented 9 years ago

Here's an article about using tsocks with scrapy: http://blog.scrapinghub.com/2010/11/12/scrapy-tsocks/

Not sure SOCKS5 is something we'd want to support directly on Scrapy, since HTTP proxies are often enough. Could you elaborate on your need @cydu?

cydu commented 9 years ago

@pablohoffman Thank you for your reply. But in my case tsocks can't work, I have to crawl several site use different proxy, because of performance and security reason.

Something like this:

DOWNLOAD_HANDLERS = {
    'aaa.com': 'myspider.http_proxy.HttpProxyDownloadHandler',
    'bbb.com': 'myspider.socks5_proxy.Socks5DownloadHandler',
    'ccc.com': 'myspider.no_proxy.HTTP11DownloadHandler',
} 

I have implement Socks5DownloadHandler, but because it depends on

#txsocksx is installed from pip install txsocksx
from txsocksx.http import SOCKS5Agent

so I don't know is it ok to pull a request?

Here is my Socks5DownloadHandler code: https://gist.github.com/cydu/8a4b9855c5e21423c9c5

darioguarascio commented 9 years ago

Hi @pablohoffman thanks for your awesome scrapy!

I was looking for something similar, i think is a big lack that such a complete software is missing socks support.

My case is a bit different. I currently use a middleware to remotely request a proxy from a dispatcher server, then i assign the obtained proxy to the current request. Everything is working fine if they are http proxy, but I'm struggling to get the same behavior with socks.

Imagine i have a list of socks, and i want to use a random proxy per crawl. I didn't find a way to make this work with http-to-socks converters (polipo, privoxy, etc) and i thought i could write a customized server to handle that, maybe with special header to define to which sock to connect to..but I think is way too much complexity!

Here is what i have for http proxies, and I would like to do something very similar for socks

import base64, urllib2, json

class ProxyMiddleware(object):
    # overwrite process request
    proxyDispatcher = "http://my.proxy.dispatcher/"

    def process_request(self, request, spider):
        if spider.name == 'some-particular-spider':
            if not spider.proxy:
                response = urllib2.urlopen(self.proxyDispatcher)
                try: 
                    spider.proxy = json.loads(response.read())
                except: 
                    spider.proxy = False
            if spider.proxy:
                request.meta['proxy'] = "http://%s" % spider.proxy['host']
                request.headers['Proxy-Authorization'] = 'Basic ' + base64.encodestring(spider.proxy['auth'])

            return None
traverseda commented 9 years ago

:+1:

Can we at least get an "enhancement" tag on this?

boltgolt commented 9 years ago

Would be really useful, indeed

bufrr commented 8 years ago

Would be really really really really really really really really useful, indeed

robsonpeixoto commented 8 years ago

:+1:

robsonpeixoto commented 8 years ago

@pablohoffman There are a lot of cheap and good proxy that only support SOCKS4/5. This feature will be ridiculous useful. I want to use scrapy but I can't because of it. :cry: And I really would like to use scrapy, because it's f**\ amazing.

redapple commented 8 years ago

@robsonpeixoto , I'll raise the priority of this. Thanks for the feedback!

AdolphYu commented 8 years ago

Chinese need socks proxy, because they have GFW.

kmike commented 8 years ago

A PR with socks proxy support is welcome! AFAIK nobody is working on it now.

pawelmhm commented 8 years ago

This is somewhat complicated because Scrapy uses Twisted Agents in downloader, and Twisted doesn't have socks client. There is only socks4 server. I did some research about this and there is unfinished ticket for socks client: https://twistedmatrix.com/trac/ticket/3508

To implement socks support for Scrapy we would have to either:

  1. use existing socks-twisted libraries that are not official part of Twisted this one here looks like best one around https://github.com/habnabit/txsocksx, but this looks like it's Python 2 only :/
  2. Contribute to Twisted and add support for socks client to Twisted.
  3. Write socks client for Scrapy ourselves.

Which option is best?

daidoji commented 7 years ago

+1

jbagot commented 7 years ago

I have a big scrapy project with a large amount of crawlers and I use socks with socks-http-converter called polipo. I have a huge amount of socks ports and I only need create one polipo for every sock port and you can connect each polipo to one socks port. Then I have a queue of polipo ports (HTTP) and get these.

polipo socksParentProxy=localhost:$socks_port proxyPort=$polipo_port & > /dev/null

This code insite a loop and it's all

But, I would prefer use a direct socks to twister because polipo instances waste memory of server and the performance is lesser than it could be.

Margular commented 7 years ago

proxychains scrapy crawl spider_name

dchrostowski commented 7 years ago

Also requesting this. +9001 internets to whoever can implement.

traverseda commented 7 years ago

Can internets be converted into currency?

https://www.bountysource.com/issues/2654811-support-for-socks5-proxy

dchrostowski commented 7 years ago

I would, but some extra research on google led me to a few options I'm going to try first. I'm working on an automatic public proxy farm (not using Tor) which I hope will be useful to all scrapy enthusiasts, so I will likely release it to open source in lieu of bounty money regardless of whether or not socks5 gets officially implemented/supported. I already have a nice prototype that's been running for over a year and collected well over 350K public proxies. A significant portion of these are socks proxies and I've been using with my crawling infrastructure that was initially built on top of Perl. I'm in the middle of converting everything to Python though because I just can't deal with Perl anymore and I have picked scrapy as my crawling framework. I'm intending on releasing it primarily as a scrapy middleware but I also designed it with modularity in mind so that it should be able to be hooked into just about anything with few headaches.

dchrostowski commented 7 years ago

Basically, you just get free proxy servers and you don't even have to think about it. It maintains itself.

bakwc commented 6 years ago

Surprised that the best python scraping library does not have SOCKS5 proxy support at 2017. It's very sad :(

redapple commented 6 years ago

That's why we need help! Any volunteers?

traverseda commented 6 years ago

Throw some money at the bounty maybe @bakwc

https://www.bountysource.com/issues/2654811-support-for-socks5-proxy

dchrostowski commented 6 years ago

I thought I'd just share how I'm getting socks support with scrapy. Basically there are two pretty good options, DeleGate and Privoxy. I'm going to give an example of a middleware that I implemented using DeleGate which has worked great for me thus far.

DeleGate is amazingly simple and straightforward; it's basically serving as an http-to-socks bridge. In other words, you make a request to it with scrapy as if it were an http proxy and it will take care of bridging that over to the socks server. Privoxy can do this too, but it seems like DeleGate has much better documentation and possibly more functionality than Privoxy (maybe...) You can either build from source or download a pre-built binary (supports Windows, MacOS X, Linux, BSD, and Solaris). Set it up however you like so that it's on your PATH. In my Ubuntu setup I simply created a symbolic link to the binary in my /usr/bin directory. Copying it over there works too. So after it's installed, try running this in your shell:

delegated ADMIN=<whatever-you-want> RESOLV="" -Plocalhost:<localport> SERVER=http SOCKS=<socks-server-address>:<socks-server-port>

This should setup a proxy server on the local machine. A brief explanation of some of the options:

ADMIN - this can be whatever. Ideally it should be an email address to display should the DeleGate server run into a problem.

RESOLV - I forget exactly what this was doing, something to do with DNS resolution. Basically, if I didn't include this argument and set it to an empty string, I noticed I was inadvertently exposing my IP while testing against my dev server. (You may or may not need this, I suspect I needed it because I have a public DNS A record pointing to the particular machine I was testing DeleGate on)

-P[localhost]:localport- the address and port of the local DeleGate proxy server which will run. You can just set an arbitrary port.

SERVER - the protocol of the local DeleGate proxy server. In this case, we want HTTP because that's what scrapy is compatible with

SOCKS - the address and port of the socks proxy server that DeleGate will "bridge" the request to.

To shut down gracefully, you can run this command in a separate window:

delegated -P[localhost]:<localport> -Fkill

Keep in mind that this is setting up a live proxy server running on localhost. While testing I was able to access the delegate web interface through my browser. Make sure that either your firewall is setup accordingly or read the docs on setting up auth/security lest you want people like me finding it and using it.

So to make this integrate nicely with scrapy, I wrote a middleware. Here's a watered down version of it:

from my_scrapy_project.util.proxy_manager import Proxy, ProxyManager
import subprocess

class CustomProxyMiddleware(object):

    @staticmethod
    def start_delegate(proxy,localport)
        cmd = 'delegated ADMIN=nobdoy RESOLV="" -P:%s SERVER=http TIMEOUT=con:15 SOCKS=%s:%s' % (localport, proxy.address, proxy.port)
        subprocess.Popen(cmd, shell=True)
        proxy.address = 'localhost'
        proxy.scheme = 'http'
        proxy.port = localport

        return proxy

    @staticmethod
    def stop_delegate(localport):
        cmd = 'delegated -P:%s -Fkill' % localport
        subprocess.Popen(cmd, shell=True)
        ProxyManager.release_delegate_port(localport)

    def process_request(self, request, spider):
        # For simplicity I'm not including code for Proxy or ProxyManager.  Should be self explanatory.
        proxy = Proxy(ProxyManager.get_socks_proxy_params())
        localport = ProxyManager.reserve_delegate_port()
        socks_bridge_proxy = CustomProxyMiddleware.start_delegate(proxy,localport)
        request.meta['proxy'] = socks_bridge_proxy.to_string()
        request.meta['delegate_port'] = localport

    def process_response(self, request, response, spider):
        # handle response logic here

        # check if there is a delegate instance running for this request
        if 'delegate_port' in request.meta:
            CustomProxyMiddleware.stop_delegate(request.meta['delegate_port'])

    def process_exception(self, request, exception, spider):
        # handle exceptions here
dchrostowski commented 6 years ago

By the way, I should mention, this was written to accommodate thousands of socks proxies that my bots have found. If you have a smaller number, then it might make more sense to keep the delegate instances open and running all the time rather than allocating a port number, starting, and then stopping the instance for each request. In my real application, I'm cycling through a very large pool of proxies cached in memory consisting of both socks and http proxies. I estimate the ratio to be around 1:10 socks:http so this makes sense for my project and I'm not rapid fire opening and closing delegate ports.

pablohoffman commented 6 years ago

thanks for sharing @dchrostowski , that's worth a blog post! :)

tigercbc commented 6 years ago

works for me, thanks! @cydu

percy507 commented 6 years ago

use privoxy tool

see: https://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/

barravi commented 5 years ago

Hello. I'd like to post my 2 cents here.

Any of the proposed workarounds fall short due to the fact that softwares like Privoxy do not support authenticated Socks proxy. This might seem like an edge case, but VPN providers such ask IPVanish offer a proxy service with their VPN subscription that is authenticated and socks only. It is very convenient as it allows a crawler to spoof its IP from different countries in the world, but it's not supported out of the box by scrapy.

So far I tried both Privoxy, Polipo, Ncat and whatever else I could stumble upon to try to setup an HTTP-to-SOCKS proxy that would authenticate the proxy connection, without any luck.

The one solution I found so far that works is to use Scrapinghub's Splash with a proxy profile setup. Splash actually supports authenticated SOCKS. However, it would be nice to have out-of-the-box support for socks proxy the same way one has with HTTP proxy.

I found some code lying around for a download handler that supports SOCKS, I'll try to integrate it in my Scrapy project and I'll post a pull request one it works.

freddong commented 4 years ago

try https://github.com/moreati/pproxy

pproxy  -l http://:8181  -r "socks5://example.com:8030#user:pass" -vv
davidblus commented 4 years ago

@pablohoffman Thank you for your reply. But in my case tsocks can't work, I have to crawl several site use different proxy, because of performance and security reason.

Something like this:

DOWNLOAD_HANDLERS = {
    'aaa.com': 'myspider.http_proxy.HttpProxyDownloadHandler',
    'bbb.com': 'myspider.socks5_proxy.Socks5DownloadHandler',
    'ccc.com': 'myspider.no_proxy.HTTP11DownloadHandler',
} 

I have implement Socks5DownloadHandler, but because it depends on

#txsocksx is installed from pip install txsocksx
from txsocksx.http import SOCKS5Agent

so I don't know is it ok to pull a request?

Here is my Socks5DownloadHandler code: https://gist.github.com/cydu/8a4b9855c5e21423c9c5

I have tried this code, but It doesn't work for me. Maybe my socks proxies are socks4 all. There are many methods to bypass it in the above code. I will put my bypassing method to here by using requests[socks].

My DOWNLOADER_MIDDLEWARES:

def process_request(self, request, spider):
    """
    每次网络请求时调用
    :param request: 请求对象
    :param spider: 爬虫对象
    :return:
    """
    request.meta['proxy'] = self.proxy_url

    # 考虑socks代理,使用requests库进行请求
    if self.proxy_url.startswith('socks'):
        url = request.url
        method = request.method
        headers = {key: request.headers[key] for key in request.headers}
        body = request.body
        cookies = request.cookies
        timeout = request.meta.get('download_timeout', 10)
        proxies = {'http': self.proxy_url,
                   'https': self.proxy_url}

        resp = requests.request(method, url,
                                data=body,
                                headers=headers,
                                cookies=cookies,
                                verify=False, timeout=timeout, proxies=proxies)
        resp.headers['content-encoding'] = None
        response = TextResponse(url=url, headers=resp.headers, body=resp.content,
                                request=request, encoding=resp.encoding)
        return response

    return None

It's too long to crawl some web page!!!

Then I figure out why Socks5DownloadHandler doesn't work for me. Look this comment: https://gist.github.com/cydu/8a4b9855c5e21423c9c5#gistcomment-3312848.

asifmallik commented 3 years ago

@darthbeep and I would like to work on this issue

Gallaecio commented 3 years ago

@asifmallik No idea how @darthbeep is involved here, but feel free.

asifmallik commented 3 years ago

@Gallaecio oh @darthbeep is a friend and we wanted to work on this together

Gallaecio commented 3 years ago

I missed the “and” in the original message :facepalm:

So yes, feel free to give it a try :slightly_smiling_face:

asifmallik commented 3 years ago

We have been looking into the codebase and past attempts to integrate SOCKS5 proxies into Scrapy. Every solution we looked at that does not use some hacky method (for example by using delegate and privoxy) seems to use txsocksx. However, the problem with using txsocksx seems to be that it is Python 2 only. It seems that any method would require writing a new download handler. So, we thought of three different ways of implementing this:

  1. Copying code from txsocksx and modernize it (to Python 3) to write a Twisted Agent for SOCKS5 and using this to write a download handler
  2. Writing a Twisted Agent for SOCKS5 proxies from scratch
  3. Sidestepping Twisted and synchronously resolve the request and returning a response just like in file.py

We don't think 3 is a good way of doing it given every other handler except the file handler uses Twisted in some way. We are not exactly sure why that is not a good idea but I think it has something to do with concurrency and limiting resources that Twisted does best and sidestepping Twisted would cause problems elsewhere.

@Gallaecio Since this seems like a pretty big project we thought we should ask for feedback before proceeding any further. What do you think? Is option 1 reasonable? Why exactly is 3 a bad idea if at all?

Gallaecio commented 3 years ago

3 is indeed a bad idea because of concurrency. In Scrapy, while some responses are being downloaded, your code can be handling one of the responses that has already been received. If you resolve a request synchronously, nothing else can happen from the moment the request starts to the moment the response is received in full, negating one of the main benefits of Scrapy. For a better understanding of concurrency, I recommend this first chapter of an introduction to Twisted.

1 seems OK, their license seems compatible with ours. 2 is also OK, of course. Depending on the amount of code, it may make sense to first create a separate library, and then use that library from Scrapy.

It seems that any method would require writing a new download handler.

I’m not an expert in this part of Scrapy, and I’m completely unfamiliar with SOCKS, but I have my doubts on whether a separate download handler for SOCKS is the way to go. So unless you are sure it’s the best way, I would try to get some feedback about that from someone more familiar with Scrapy download handlers (cc: @dangra, @kmike, @elacuesta).

jugrajsingh commented 3 years ago

Would really appreciate this feature being added to the tool. http proxies are usually over exhausted.

frenzyk commented 3 years ago

@asifmallik, as consideration I was thinking about an idea to use requests + txrequests instead of Scrapy requests/responses. Although I don't know how hard it will be to implement with backward compatibility.

honzajavorek commented 3 years ago

@Cqxstevexw That's not how Open Source works. Comments like this don't help anyone with anything.

Cqxstevexw commented 3 years ago

@Cqxstevexw That's not how Open Source works. Comments like this don't help anyone with anything.

You are right, we should think of ways to solve it instead of complaining

theol-git commented 2 years ago

Hello everyone, after looking into how to implement this into my project, I found a temporary solution. While it is not perfect, it does get the job done. I connected to a SOCKS5 proxy using this library, and monkey patching the socket library in my run file ( this will make twisted use your custom socket, that will connect to the proxy).

Sadly I dont think it is possible to control the proxy you are inside spiders with this strategy (I am using an external proxy manager that handles this for me)

Hope this helps someone out

milahu commented 2 years ago

monkey patching the socket library in my run file

where? i always get dns leaks = "DNS lookup failed" for onion domain

rdns=True by default ...

# pip install PySocks
import socks
# help(socks.set_default_proxy)
set_default_proxy(proxy_type=None, addr=None, port=None, rdns=True, username=None, password=None)
pproxy  -l http://:8181  -r "socks5://example.com:8030#user:pass" -vv

as scrapy middleware

tor_socks5_port = 9050
http_proxy_port = 8085 # TODO find random free port
import subprocess
import shutil
import time
class TorDownloaderMiddleware(object):
    def __init__(self):
        super().__init__()
        pproxy = shutil.which("pproxy")
        print(f"starting pproxy on port {http_proxy_port}")
        args = [pproxy, "-l", f"http://127.0.0.1:{http_proxy_port}/", "-r", f"socks5://127.0.0.1:{tor_socks5_port}/"]
        subprocess.Popen(args)
        time.sleep(1) # allow to start
    def process_request(self, request, spider):
        request.meta['proxy'] = f'http://127.0.0.1:{http_proxy_port}'

its just a little ugly, cos it needs to bind a tcp port on localhost

drunkpig commented 1 year ago

Here is my solution and it works very well.

My ENV:

Scrapy 2.6.1 python 3.8.10 Twisted 22.4.0 txsocksx 1.15.0.2.post5

txsocksx MUST install from https://github.com/unk2k/txsocksx , for this library supports python3.x

root@crawler-ubuntu:~# conda activate shuziren
(shuziren) root@crawler-ubuntu:~# python --version
Python 3.8.10
(shuziren) root@crawler-ubuntu:~# pip list
Package            Version
------------------ --------------
attrs              21.4.0
Automat            20.2.0
certifi            2022.6.15
cffi               1.15.1
charset-normalizer 2.1.0
constantly         15.1.0
cryptography       37.0.4
cssselect          1.1.0
filelock           3.7.1
hyperlink          21.0.0
idna               3.3
incremental        21.3.0
itemadapter        0.6.0
itemloaders        1.0.4
jmespath           1.0.1
loguru             0.6.0
lxml               4.9.1
parsel             1.6.0
Parsley            1.3
pdfminer.six       20220524
pdfplumber         0.7.1
Pillow             9.2.0
pip                22.1.2
pproxy             2.7.8
Protego            0.2.1
pyasn1             0.4.8
pyasn1-modules     0.2.8
pycparser          2.21
PyDispatcher       2.0.5
pyOpenSSL          22.0.0
PySocks            1.7.1
queuelib           1.6.2
requests           2.28.1
requests-file      1.5.1
Scrapy             2.6.1
service-identity   21.1.0
setuptools         61.2.0
six                1.16.0
tldextract         3.3.1
Twisted            22.4.0
txsocksx           1.15.0.2.post5
typing_extensions  4.3.0
urllib3            1.26.10
vcversioner        2.16.0.0
w3lib              1.22.0
Wand               0.6.7
wheel              0.37.1
zope.interface     5.4.0

My Project:

(shuziren) root@crawler-ubuntu:~/workspace/shuziren/asos# ls -lt asos
total 62
drwxr-xr-x 3 root root  4096 Jul 17 00:04 spiders
-rwxr-xr-x 1 root root  3110 Jul 16 23:53 pipelines.py
-rwxr-xr-x 1 root root  3392 Jul 16 16:45 settings.py
-rwxr-xr-x 1 root root  2597 Jul 15 20:48 s5downloader.py
-rwxr-xr-x 1 root root  3644 Jul 15 20:48 middlewares.py
-rwxr-xr-x 1 root root  1245 Jul 15 20:48 items.py
-rwxr-xr-x 1 root root     0 Jul 15 20:48 __init__.py
-rwxr-xr-x 1 root root  3528 Jul 15 20:48 US_100_S5_20220531.txt
(shuziren) root@crawler-ubuntu:~/workspace/shuziren/asos# tail -f asos/US_100_S5_20220531.txt 
user:pass@1.2.3.4:64003
user:pass@5.6.7.8:64003
user:pass@9.10.11.12:64003

The file US_100_S5_20220531.txt contains my socks5 proxies, one per line as you see.


STEP 1

settings.py adds the following line telling the program where to find the socks5 proxies

PROXY_FILE = os.path.dirname(__file__) +"/US_100_S5_20220531.txt"

STEP 2

The most imporrant part is comming now: s5downloader.py

from typing import List

from txsocksx.http import SOCKS5Agent
from twisted.internet import reactor
from twisted.internet.endpoints import TCP4ClientEndpoint
from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler, ScrapyAgent
import random
from urllib.parse import urlsplit
from loguru import logger

# Ref https://txsocksx.readthedocs.io/en/latest/#txsocksx.http.SOCKS5Agent

import certifi, os

os.environ["SSL_CERT_FILE"] = certifi.where() # if not setted , you'll got an ERROR : certificate verify failed')] [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('STORE routines', '', 'unregistered scheme')

class Socks5DownloadHandler(HTTP11DownloadHandler):

    def download_request(self, request, spider):
        """Return a deferred for the HTTP download"""
        settings = spider.settings
        agent = ScrapySocks5Agent(settings, contextFactory=self._contextFactory, pool=self._pool, crawler=self._crawler)
        return agent.download_request(request)

class ScrapySocks5Agent(ScrapyAgent):
    def __init__(self, settings, **kwargs):
        """
        init proxy pool
        """
        super(ScrapySocks5Agent, self).__init__(**kwargs)
        self.__proxy_file = settings['PROXY_FILE']
        self._s5proxy_pool: List = self.__get_s5proxy_pool()

    def _get_agent(self, request, timeout):
        _, proxy_host, proxy_port, proxy_user, proxy_pass = self.__random_choose_proxy()
        proxy_user = bytes(map(ord, proxy_user))  # It's very strange, may be it's a BUG
        proxy_pass = bytes(map(ord, proxy_pass)) # It's very strange, may be it's a BUG
        proxyEndpoint = TCP4ClientEndpoint(reactor, proxy_host, proxy_port)
        agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint,
                            endpointArgs=dict(methods={'login': [proxy_user, proxy_pass]}))
        return agent

    def __get_s5proxy_pool(self) -> List:
        """
        return proxy pool
        :return:
        """
        proxy_list = []
        with open(self.__proxy_file, 'r') as f:
            for line in f:
                line = line.strip()
                if not line or line.startswith("#"):
                    continue
                else:
                    proxy_info = urlsplit(f"socks5://{line}")
                    schema, user, passwd, host, port = proxy_info.scheme, proxy_info.username, proxy_info.password, proxy_info.hostname, proxy_info.port
                    proxy_list.append((schema, host, port, user, passwd))

        return proxy_list

    def __random_choose_proxy(self):
        """
        schema, host, port, user, pass
        :return:
        """
        p = random.choice(self._s5proxy_pool)
        logger.info("use proxy {}", p)
        return p

STEP 3

At last , tell your spider how to use the Socks5DownloadHandler:


class MySpider(scrapy.Spider):
    name = "myname"
    allowed_domains = ["oh.com"]
    custom_settings = {
        # other configurations

        "DOWNLOAD_HANDLERS": {
            'http': 'asos.s5downloader.Socks5DownloadHandler',
            'https': 'asos.s5downloader.Socks5DownloadHandler',
        },
       # other configurations

    }

STEP 4

$scrapy crawl  you-spider

That's All, It works fine for me .

milahu commented 1 year ago

txsocksx MUST install from https://github.com/unk2k/txsocksx, if you use python3.x

for alternatives see https://github.com/habnabit/txsocksx/issues/19

dchrostowski commented 1 year ago

Haha I can't believe 5 years have gone by since I posted my workaround. I'm starting up a new scrapy project so I'll give all of these new solutions a try for SOCKS proxies.

AnaKuzina commented 1 year ago

@drunkpig hi and thank you for this awesome implementation! However my spider runs into following error even if I just copy-paste your code inside my project: Message: 'Error downloading %(request)s'

Do you have any idea why this error can occur? Thank you in advance!

drunkpig commented 1 year ago

@AnaKuzina

hi, please provide more info about your application.

AnaKuzina commented 1 year ago

@drunkpig thank you for you response!

I create POST request to an API, extract urls from response and create GET requests to these urls.

I need to use socks5 only for these urls, because API and urls are on different domains. So I've copied your code and modify DOWNLOADER_HANDLERS (I make API requests through http): DOWNLOAD_HANDLERS = { 'https': 'my_project.s5downloader.Socks5DownloadHandler', }

Then I run spider and it was able to get response from POST request and extract urls. But then on each url I get following error: Message: 'Error downloading %(request)s' Arguments: {'request': <GET <https://needed_url.com>

No more info.