EndOfStreamError and SocketError when visiting multiple urls during one session

antchien commented 9 years ago

I'm trying to scrape the content from a large list of urls. However, it seems to throw out EndOfStreamErrors ('Unexpected end of file') or Socket Errors ('Connection closed by peer') after going through a number of them (usually around 20-30 but it varies). I tried catching these exceptions, and telling it to just continue on w/ the rest of the list, but all remaining url visits result in more Socket Errors ('Broken Pipe').

Here's some sample dumbed down code:

import dryscrape dryscrape.start_xvfb() sess = dryscrape.Session()

list_of_urls = ['http://www.abc.com', etc...]

for url in list_of_urls: try: sess.visit(url) print url + ': ' + sess.body()[:50] except: continue

Is there something that I'm doing incorrectly causes these errors? It seems to happen pretty consistently after around the same number of urls processed, with some urls causing more problems than others.

Thanks in advance!

niklasb commented 9 years ago

Interesting, that suggests that webkit_server crashed. Is there a chance you can provide a list of URLs that I can use to reproduce the issue? Or maybe even just a single one?

antchien commented 9 years ago

Here's a few: 'http://time.com/photography/', 'http://time.com/collection/top-10-everything-of-2014/', 'http://time.com/world-trade-center/', 'http://time.com/tag/wonders-of-the-world/', 'http://time.com/space-nasa-scott-kelly-mission/', 'https://subscription.time.com/storefront/subscribe-to-time/link/1023466.html', 'http://content.time.com/time/rss', 'http://content.time.com/time/reprints']

Just testing some of these individually, it actually looks like the 'http://time.com/space-nasa-scott-kelly-mission/' crashes all the time, while some others causes problems once in a while after hitting it repeatedly.

niklasb commented 9 years ago

Thanks @antchien, I will have a look at this tomorrow

antchien commented 9 years ago

Thanks! Here's one that seems to crash consistently after hitting it a few times: http://time.com/collection/question-everything/

niklasb commented 9 years ago

Hi @antchien, unfortunately I haven't been able to reproduce this on my machine (Mac OS X). I ran your URLs in a cycle for 2 hours now, without a crash. What operating system do you use? Do you have the newest version of dryscrape installed? In case I can't reproduce this even on a Linux system (which I can try next week), it might be helpful if you can try and run webkit_server manually and connect to it, so that you can see the actual crash. To do that, open a terminal and run /usr/[local/]lib/pythonX.X/site-packages/webkit_server. In another terminal, use this script:

import socket
import webkit_server
class TcpServer(object):
    def __init__(self, port):
        self.port = port
    def connect(self):
        return socket.create_connection(('127.0.0.1', self.port))
    def kill():
        pass

# set port to the output of webkit_server
port = 53950
s = webkit_server.Client(webkit_server.ServerConnection(TcpServer(port)))
# visit the crashing URLs here
s.visit('http://time.com/photography/')

And then hopefully you will see the crash. If there is a segfault, having a core dump and your webkit_server binary would of course be extermely helpful.

Best, Niklas

antchien commented 9 years ago

Thanks Niklas! I'm using ubuntu (a Rackspace instance). Started the webkit_server with /usr/local/lib/python2.7/dist-packages/webkit_server -platform offscreen

After trying to visit a url, the webkit_server crashes, throwing the following error: QFontDatabase: Cannot find font directory /usr/lib/x86_64-linux-gnu/fonts - is Qt installed correctly? Aborted

I'm guessing the webkit didn't install correctly? Although this is crashing it for any url, not just the troubled ones.

niklasb commented 9 years ago

Hi @antchien,

why do you run it with -platform offscreen? That's not supported, I think I tried that once in order to avoid using xvfb, but it doesn't work. Even if it doesn't crash, the rendering is messed up for me. So I suggest you use xvfb instead.

Best, Niklas

vdraceil commented 8 years ago

@niklasb , I tried running the quick script you provided and yes, the webkit_server seems to crash. I'm getting a Segmentation Fault for the URL 'http://time.com/photography/'.

>>>/usr/local/lib/python2.7/dist-packages/webkit_server                        
Capybara-webkit server started, listening on port: 40031
Segmentation fault (core dumped)

I'm using Ubuntu 14.04.

Actually, I'm encountering this kind of crash for multiple URLs -

niklasb / dryscrape

EndOfStreamError and SocketError when visiting multiple urls during one session #35