niklasb / webkit-server

[not actively maintained] The C++ webkit-server from capybara-webkit with useful extensions and Python bindings
MIT License
48 stars 38 forks source link

Binding IP #1

Closed knarfytrebil closed 12 years ago

knarfytrebil commented 12 years ago

Greetings niklasb,

I have looked into the source code and wondered if there it is possible to add ip binding feature to this webkit-server: meaning: binding ip while doing out going http request.

Can it be done just by modifying the "NetworkAccessManager.h"?

Thanks.

niklasb commented 12 years ago

@knarfytrebil: No, you'd need to change the file src/Server.cpp (line 15). This is only for the control socket, though, and it listens on all interfaces by default, so I'm not sure why you might need this.

EDIT: Sorry, I misunderstood the question. No, neither Linux nor Windows provide the possibility to specify an interface for outgoing connections. You need to adapt your routes or use something like iptables to achieve this. You can also configure webkit-server to use an outgoing HTTP proxy. Why do you need this / what do you want to achieve?

knarfytrebil commented 12 years ago

Well, it happens to me that I have a access to a list of ip address to avoid certain service provider from blocking me crawling on their sites. I used to craw with mechanize on python, but since, mechanize + beautifulsoup has nothing to do with javascript (automatic javascript parsing / running) I found your package dryscrape quite useful, while lacking the ability to bind http requests it makes to IP Addresses on one linux machine.

niklasb commented 12 years ago

@knarfytrebil: What do you mean, "access to a list of IP addresses"? How did you do it in mechanize? Also, look at my update: Maybe you can configure webkit-server to use different HTTP proxies. You could also override sendCustomRequest to use a custom socket with SO_BINDTODEVICE and try to make Qt use that somehow, but it's not gonna be pretty.

knarfytrebil commented 12 years ago
class BindableHTTPConnection(httplib.HTTPConnection):
    def connect(self):
        """Connect to the host and port specified in __init__."""
        self.sock = socket.socket()
        self.sock.bind((self.source_ip, 0))
        if isinstance(self.timeout, float):
                self.sock.settimeout(self.timeout)
        self.sock.connect((self.host,self.port))

def BindableHTTPConnectionFactory(source_ip):
    def _get(host, port=None, strict=None, timeout=0):
        bhc=BindableHTTPConnection(host, port=port, strict=strict, timeout=timeout)
        bhc.source_ip=source_ip
        return bhc
    return _get

class BindableHTTPHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(BindableHTTPConnectionFactory(ip_addr), req)

class BindableBrowser(mechanize.Browser):
    """docstring for BindableBrowser"""
    handler_classes = copy.copy(mechanize.Browser.handler_classes)
    handler_classes["http"] = BindableHTTPHandler

This is how I do in mechanize.

niklasb commented 12 years ago

@knarfytrebil: Extending QNetworkAccessManager to use a custom socket factory doesn't seem to be trivial. Apparently you'd have to replicate much of the HTTP logic as well. Qt has an open bug for this. It would probably be easier to write an HTTP proxy in Python and use that from webkit-server.

knarfytrebil commented 12 years ago

@niklasb Thanks for the suggestion. I found this on the QT Developer Network, which also points out that what I should probably do is to load the html with python and do the setContent. I will give that a try, thanks !! I just wonder if anyone had ever tried to put mechanize / beautifulsoup / pyexecjs together.

niklasb commented 12 years ago

@knarfytrebil: webkit-server already has the SetHtml command, so you don't even need to implement this by yourself. However, this will only load the initial HTML through Python, additional resources like Javascript or CSS content will automatically be fetched using the QNetworkAccessManager instance associated with the QWebPage.

knarfytrebil commented 12 years ago

@niklasb I see the problem, I will see to it and work out a solution.