postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
798 stars 109 forks source link

Multithreading #26

Open ethicalhack3r opened 13 years ago

ethicalhack3r commented 13 years ago

Hi there,

I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?

I had a look through the source but couldn't find where the spidr gem makes its http requests.

Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)

Thanks, Ryan

postmodern commented 13 years ago

This is possible, but difficult. The main problem is a race condition between the url/page callbacks and the requesting of pages. The callbacks could modify the filtering rules, as another thread is requesting a page, that is suddenly unwanted. The second problem is currently Spidr uses persistent HTTP connections, so I'm unsure how multi-threading would improve performance? We've been looking at alternative HTTP libraries, but they all have various pros/cons.

ethicalhack3r commented 13 years ago

Thanks for the quick response. I don't know too much about multi-threading, maybe X amount of persistent HTTP connections can be opened?!

Either way seems like a difficult task to achieve.

nirvdrum commented 12 years ago

If you decide to go with it, I'd give Celluloid a look. Alas, it is Ruby 1.9 only due to its use of fibers. But it's a pretty nice library.

postmodern commented 12 years ago

I'm considering switching to net-http-persistent, a Thread pool for requests, with mutexes around adding filters.

grrowl commented 10 years ago

+1, this seems to be the best spider/crawling library out, and this would be a great feature.

dadamschi commented 8 years ago

What happened with this request?

postmodern commented 8 years ago

I don't have the time currently to work on such a large feature.

ZeroChaos- commented 7 years ago

been a year, any chance you have time to work on such a feature now? :-)

fuzzygroup commented 7 years ago

I've written more than a crawler or N in my career and if you didn't make it multi-threaded from the start, it is damn hard to do so in retrospect. Now, that said, I think the overall goal here is throughput rather than threads. If the discovered urls can be surfaced to an overall queue (Redis or SQS) then that would change the equation because rather than threads you simply run more instances (or containers) of Spidr and let the queue handle distribution of work across N copies.

Thoughts?

postmodern commented 7 years ago

A distributed Spidr is a little out of scope, or at least further down the road.

Multi-threading here is mainly to address blocking I/O when waiting on responses to come back from the HTTP Sessions. Luckily, net-http-persistent is already thread aware. We'd just need to replace the spidering loop with a producer/consumer thread pool. Each thread would have it's own session cache via net-http-persistent, would dequeue URLs, and enqueue the responses/Pages. All additional logic with headers and parsing HTML would still be done in the main thread, to avoid additional Mutex complexity. There's probably other hidden work and locking issues hidden in the details.

vwochnik commented 6 years ago

+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way.

dadamschi commented 6 years ago

I don't understand "a producer/consumer for the requests"...

================== David Adams dadams.chi@gmail.com

On Sun, Sep 24, 2017 at 9:33 AM, Vincent Wochnik notifications@github.com wrote:

+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/postmodern/spidr/issues/26#issuecomment-331713971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzcSIIeYHPwcLScA05GBD8Qk5UGEPG0ks5slmhIgaJpZM4BF_tL .

vwochnik commented 6 years ago

I mean a producer/consumer pattern where one a thread pool of worker threads that do the requesting are connected to the main thread with queues, like a manufacturing band.

The main thread puts all requests that it wants to have resolved in a queue and any worker thread can pick the task from the queue, do the request, and put it inside the finished responses queue which is being read by the main thread. In this way, the main thread does not do any requesting, i.e. blocking activity, itself which will lead to a speedup.