taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Track url added by #add_url #33

Open tmaier opened 10 years ago

tmaier commented 10 years ago

I found urls added by #add_url (manually or at the beginning of #takeover) are not tracked by url_tracker

This commit adds url_tracker.visit to #add_url

My question is, what is after this change the difference between #add_url and the private method #enqueue?

I have a second question regarding #enqueue.

def(enqueue url_to_visit, current_page, queue)

why does #enqueue require a queue as attribute? Would it not be possible to use @internal_queue?

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-7.16%) when pulling 5180dac2e02f181f1b84bed4edfcef4db0f3f07d on tmaier:patch-1 into 7de6421fb07025d76fa03127dc2c44c1443e90ea on taganaka:master.

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-0.21%) when pulling ad02f753c4e24ff1de5687c35e2f3b24839d38c8 on tmaier:patch-1 into 7de6421fb07025d76fa03127dc2c44c1443e90ea on taganaka:master.

taganaka commented 10 years ago

Answering in random order :stuck_out_tongue:

why does #enqueue require a queue as attribute? Would it not be possible to use @internal_queue?

Should be absolutely safe to use @internal_queue here since Redis client is thread safe. Also, being #enqueue a private method, the change should not break any code out of there.

We could also get rid of the use of @http_pool and @queues_pool since each thread will use its local instance [1]

I found urls added by #add_url (manually or at the beginning of #takeover) are not tracked by url_tracker

I added add_url method as shorthand to enqueue an url regardless its state into the tracker.

The tracker is mainly used as a fast way to determine whether an url is already enqueued or it is about to be enqueued by another thread since there is no an easy fast and memory efficient way to enumerate elements stored into a Redis LinkediList (it is a O(N) operation).

I ran into some cases where having the queue pre-seeded by some external process, not necessarily part of a crawling session, with a large number of urls was useful. I found useful, for instance, re-seed the queue with some buckets of urls just to keep the stored data selectively up to date. Hence I would keep add_url as is, as an easy way to enqueue an url skipping the track part.

Though you are right, seeded urls should be tracked by the tracker! The p_seeded hack will guarantee pages used as initial seed will be re-evaluated but yes, those pages should be added to the tracker. We may add yield(page) if block_given? into the enqueue method and change add_url to enqueue

[1]https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L108-L110