Problem in serializing the Node.

bibiboot commented 12 years ago

I am trying to serialize the node and store it in redis ( Crawling a website and using redis to act as queue ), but the Node class which is webkit-server.webserver.Node do not contain the functions __getstate__ and __setstate__ due to which it is failing to serialize. ( Learned this info from stackeoverflow and also tested it on my own custom class)

I am using the following code.

import pickle pickle.dumps(node)

This is more of a query then an issue, as i couldn't get your email id. Therefore i am contacting you here. Please help as to what should be the __getstate__ and __setstate__ functions.

I am also pasting a sample code which worked when __getstate__ and __setstate__ were added. http://pastebin.com/xK20DRyD

niklasb commented 12 years ago

Hello and thanks for taking the time to write this up. The issue tracker is absolutely the right place to ask, issues can be bug reports, feature requests or general questions.

Now, the Node class is just a thin wrapper around a node inside the current DOM of the webkit_server, which is an external process. A Node object basically just contains a numerical ID to identify the node and a reference to the open socket connection to the server. Thus, the state of a node depends on the internal browser state of the webkit_server, which is not reproducable, let alone serializable.

So unless you can share a socket with your worker threads/processes, there is no sensible way to serialize a node. If sharing the socket is an option, you'll have to handle synchronization by yourself, as webkit_server obviously is not thread-safe (it wouldn't make sense to be thread-safe either, seeing that webkit allows no parallel processing).

If you really want to go down that road (I don't see why you would, you could just as well design your program as a sequential algorithm), I strongly discourage implementing __getstate__ and __setstate__, as they cannot be implemented sensibly for the Node class. Instead, you should write a custom function that transforms a node into whatever serializable representation you need.

Maybe if you provide more information about what you want to achieve, I can be of more help :)

bibiboot commented 12 years ago

Thanks for explaining it to me nicely. My basic purpose was to store the nodes at different point in crawling and start the crawling business from their ( these points will acts as restore points ) , thereby avoiding the repetitive code and execution. Anyways i would stick to sequential implementation.

Do you recommend any ways by which we can keep restore points during crawling.

niklasb commented 12 years ago

Now that's definitely not possible. Webkit uses mutable state, which is the logical thing to do in a real browser context (the whole idea is for dryscrape to act like a real browser). What you describe would require a browser engine based on immutable state or copy-on-write-like techniques. I know of no such engine, and even if one existed, it would probably be a lot slower and less compatible than webkit.

So what you probably want to do is to write an explicit restore procedure yourself. Maybe it's as simple as loading an URL, maybe you also need to click some buttons afterwards. What you can also try is to save the page HTML and current URL as strings and call set_html in your restore procedure. Both of those approaches would have the advantage that you can reproduce the same state in multiple sessions. Remember that you can have several distinct sessions, so you can even parallelize the scraping process this way.

bibiboot commented 12 years ago

Remember that you can have several distinct sessions, so you can even parallelize the scraping process this way but you said that the session of the webkit-server is not thread safe. I am storing the url and the cookie to act as restore point after your guidance but please clarify the parallel concept in webkit-server.

It would be great help as the scraping is very slow.

bibiboot commented 12 years ago

I have acted on your suggestion and use the url + cookie as the restore point, but if i run two instances of the client code parallel then the object is shared within them, please suggest what am i missing here ?

niklasb commented 12 years ago

Sorry, I think don't understand the question. What I meant by Webkit not being thread-safe is that you can't access a single session from multiple threads. Of course you can have multiple threads with each one using its own session. That's like having several browser windows open.

niklasb / dryscrape

Problem in serializing the Node. #9