Closed jeanmonod closed 8 years ago
Hmm, this is going to be very difficult. We must define the algorithm first. One might be clever is to sort by “difference”/delta:
The more the page is different from others, the more it bubbles up in the queue.
To compute the delta between 2 pages, we can use the URL or the content length. Thoughts?
The goal of this issue is to quickly explore a web application. It makes sense when you have a web application with thousands of pages.
To compute the delta between 2 pages, we can use the URL or the content length. Thoughts?
We have the url yet, computing on page size could be great but will require an HTTP HEAD call, this could slow down the process...
I am pretty sure we already have this information since we are already filtering by content-type.
As discussed today, one possible implementation could be in three part :
Introduce a notion of URL buckets in the application. Each bucket contains a list of URL to parse. The buckets are fed by the crawler. When an URL is finished, take an URL from the next bucket. Once we are at the last bucket, start from the first one again.
By default, the bucket are filled based on the first part of each URL path. To clarify, with your example, we wild have those buckets :
contact : /contact
faq : /faq
product : /product/a, /product/b, /product/c
home: /home
person: /person/1, /person/2, /person/3
This is pretty basic but should suffice in most cases.
As an option, the user could provide a list of regexp to fill the buckets, each regex will be matched to a bucket and when an URL match a regex it goes in the corresponding bucket. This will probably give enough flexibility for all cases.
In a second time, if needed, we could implement recursive buckets where each bucket can contain other buckets, but this seems overkill to me right now.
Any other thoughts ?
No that's fine. Let's go!
I'm not sure how the actual queue is working, maybe it's just a FIFO. A website is most of the time composed by several pages generated by some dynamic templates. And most of the time a categories of page share the same template. As your goal is to find as fast as possible all the a11y issues coming from the several templates, we should try to be more clever to select the page to process. Hopefully, most of the time, the categorisation of the pages is also reflected in the urls. So when the crawler generate a list of pages such as
We should then process them is this order
I don't know yet what formula to use, something like similarity to already parsed urls, but this could be a nice addition...