Open tmaier opened 10 years ago
Answering in random order :stuck_out_tongue:
why does
#enqueue
require a queue as attribute? Would it not be possible to use@internal_queue
?
Should be absolutely safe to use @internal_queue
here since Redis client is thread safe.
Also, being #enqueue a private method, the change should not break any code out of there.
We could also get rid of the use of @http_pool
and @queues_pool
since each thread will use its local instance [1]
I found urls added by
#add_url
(manually or at the beginning of#takeover
) are not tracked byurl_tracker
I added add_url
method as shorthand to enqueue an url regardless its state into the tracker.
The tracker is mainly used as a fast way to determine whether an url is already enqueued or it is about to be enqueued by another thread since there is no an easy fast and memory efficient way to enumerate elements stored into a Redis LinkediList (it is a O(N) operation).
I ran into some cases where having the queue pre-seeded by some external process, not necessarily part of a crawling session, with a large number of urls was useful.
I found useful, for instance, re-seed the queue with some buckets of urls just to keep the stored data selectively up to date. Hence I would keep add_url
as is, as an easy way to enqueue an url skipping the track part.
Though you are right, seeded urls should be tracked by the tracker!
The p_seeded
hack will guarantee pages used as initial seed will be re-evaluated but yes, those pages should be added to the tracker.
We may add yield(page) if block_given?
into the enqueue
method and change add_url to enqueue
[1]https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L108-L110
I found urls added by
#add_url
(manually or at the beginning of#takeover
) are not tracked byurl_tracker
This commit adds
url_tracker.visit
to#add_url
My question is, what is after this change the difference between
#add_url
and the private method#enqueue
?I have a second question regarding
#enqueue
.why does
#enqueue
require a queue as attribute? Would it not be possible to use@internal_queue
?