Fixed bug whereby uri's more than one folder deep aren't crawled

liip / TheA11yMachine

The A11y Machine is an automated accessibility testing tool which crawls and tests pages of any web application to produce detailed reports.

https://www.liip.ch/

621 stars 66 forks source link

Fixed bug whereby uri's more than one folder deep aren't crawled #58

Closed hamishtaplin closed 8 years ago

hamishtaplin commented 8 years ago

urlQueueName was being limited to a single folder, just using the uriPath as a key fixes it.

Hywan commented 8 years ago

Correct! Thanks.

Hywan commented 8 years ago

Sorry, actually it is the correct behavior.

Hywan commented 8 years ago

The “bucket” is defined by the first part of the URL. This way we have rough queue classification but it brings correct results. However, your patch implies having one bucket per URL, which is not correct. A bucket is a queue of a subset of URLs to compute. Each bucket run in parallel.

hamishtaplin commented 8 years ago

So I'm not familiar with your implementation of buckets but the upshot is you can't crawl any site where the urls go more than one folder deep. Is that something that can be addressed?

Hywan commented 8 years ago

@hamishtaplin Actually you can. The urlQueueName is just the bucket's name.

hamishtaplin commented 8 years ago

Are you sure? It does not work at all for me as it builds the queue from that piece of string processing. Any subfolders get parsed out of the urlQueueName and therefore aren't queued. For example the following urls crawled:

/folder /folder/subfolder /folder/subfolder2 /folder/subfolder3

Will just results in a urlQueueName of "folder" generated four times, the latter three will be ignored as they have already been queued.

Have you tested this?

Hywan commented 8 years ago

URLs are queued on a specific queue. The queue is determined by the urlQueueName. So if you have /foo/bar and /foo/baz, they will be queued in the queue named foo. They will be computed if the maximum URL limit allows it.

hamishtaplin commented 8 years ago

That makes sense but in practise, only the top level gets queued in my experience. I'll see if I can provide a reduced test case and open an issue if I can reproduce for you.

hamishtaplin commented 8 years ago

Ok I did so and it's working fine. However, in my actual application this is definitely not the case.

I've confirmed that the pages are being crawled but the pages do not get tested, I assumed this was because they weren't being queued, will investigate further.

Hywan commented 8 years ago

@hamishtaplin Maybe you are riching the maximum URL to compute. How many URL do you have to crawl? There is different strategies and I am slowly tending to think we should provide several algorithm instead of one to rule them all.

hamishtaplin commented 8 years ago

Ok I think I've isolated the issue. For some reason it doesn't work if you try to crawl from a subdirectory, I'm convinced there is some issue in the queuing system somewhere but I don't think I can spend any more time trying to debug as I have other things to do.