Closed hamishtaplin closed 8 years ago
Correct! Thanks.
Sorry, actually it is the correct behavior.
The “bucket” is defined by the first part of the URL. This way we have rough queue classification but it brings correct results. However, your patch implies having one bucket per URL, which is not correct. A bucket is a queue of a subset of URLs to compute. Each bucket run in parallel.
So I'm not familiar with your implementation of buckets but the upshot is you can't crawl any site where the urls go more than one folder deep. Is that something that can be addressed?
@hamishtaplin Actually you can. The urlQueueName
is just the bucket's name.
Are you sure? It does not work at all for me as it builds the queue from that piece of string processing. Any subfolders get parsed out of the urlQueueName
and therefore aren't queued. For example the following urls crawled:
/folder /folder/subfolder /folder/subfolder2 /folder/subfolder3
Will just results in a urlQueueName
of "folder
" generated four times, the latter three will be ignored as they have already been queued.
Have you tested this?
URLs are queued on a specific queue. The queue is determined by the urlQueueName
. So if you have /foo/bar
and /foo/baz
, they will be queued in the queue named foo
. They will be computed if the maximum URL limit allows it.
That makes sense but in practise, only the top level gets queued in my experience. I'll see if I can provide a reduced test case and open an issue if I can reproduce for you.
Ok I did so and it's working fine. However, in my actual application this is definitely not the case.
I've confirmed that the pages are being crawled but the pages do not get tested, I assumed this was because they weren't being queued, will investigate further.
@hamishtaplin Maybe you are riching the maximum URL to compute. How many URL do you have to crawl? There is different strategies and I am slowly tending to think we should provide several algorithm instead of one to rule them all.
Ok I think I've isolated the issue. For some reason it doesn't work if you try to crawl from a subdirectory, I'm convinced there is some issue in the queuing system somewhere but I don't think I can spend any more time trying to debug as I have other things to do.
urlQueueName
was being limited to a single folder, just using the uriPath as a key fixes it.