Closed soulgalore closed 8 years ago
Seems like something is wrong with maxdepth:
// One $ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 1 [2016-04-25 21:36:14] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3 [2016-04-25 21:36:15] Starting firefox for analysing https://www.sitespeed.io/ [2016-04-25 21:36:15] Start to crawl from https://www.sitespeed.io/ with maxDepth 1 [2016-04-25 21:36:21] Finished analysing https://www.sitespeed.io/ // Two (only one page) $ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 2 [2016-04-25 21:35:19] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3 [2016-04-25 21:35:20] Starting firefox for analysing https://www.sitespeed.io/ [2016-04-25 21:35:20] Start to crawl from https://www.sitespeed.io/ with maxDepth 2 [2016-04-25 21:35:26] Starting firefox for analysing https://www.sitespeed.io/ [2016-04-25 21:35:32] Finished analysing https://www.sitespeed.io/ // THREE now we start pickup URLs $ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 3 [2016-04-25 21:35:53] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3 [2016-04-25 21:35:54] Starting firefox for analysing https://www.sitespeed.io/ [2016-04-25 21:35:54] Start to crawl from https://www.sitespeed.io/ with maxDepth 3 [2016-04-25 21:35:59] Starting firefox for analysing https://www.sitespeed.io/ [2016-04-25 21:36:05] Starting firefox for analysing https://www.sitespeed.io/documentation/ [2016-04-25 21:36:11] Starting firefox for analysing https://www.sitespeed.io/example/ ....
Looks like this could be a bug in the crawler possible. https://www.sitespeed.io appears to redirect to https://www.sitespeed.io/ which might explain why the crawler is picking up URLS after the maxDepth 3 is set.
bin/sitespeed.js https://www.sitespeed.io -n 1 --crawler.enable=true --crawler.maxDepth 2 --debug
[2016-04-27 00:01:20] Versions OS: linux 3.19.0-58-generic sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3
[2016-04-27 00:01:20] {
"uuid": "f80dcb4f-6ad9-4de6-8bae-5060d2fd93b3",
"type": "url",
"timestamp": "2016-04-27T00:01:20-04:00",
"source": "url-reader",
"data": "{...}",
"url": "https://www.sitespeed.io"
}
[2016-04-27 00:01:20] Starting firefox for analysing https://www.sitespeed.io
[2016-04-27 00:01:20] Start to crawl from https://www.sitespeed.io with maxDepth 2
I just received https://www.sitespeed.io/ (26352 bytes)
It was a resource of type text/html
[2016-04-27 00:01:21] {
"uuid": "8b6a005d-dff3-43f4-9e89-1c94442421b4",
"type": "url",
"timestamp": "2016-04-27T00:01:21-04:00",
"source": "crawler",
"data": "{...}",
"url": "https://www.sitespeed.io/"
}
[2016-04-27 00:01:21] {
"uuid": "36acdf1b-8395-4104-8c54-74fc10334cf0",
"type": "url",
"timestamp": "2016-04-27T00:01:21-04:00",
"source": "crawler",
"data": "{...}",
"url": "https://www.sitespeed.io/"
}
Now that 0.7 of simplecrawler is released I'll push a few fixes, hopefully tomorrow.
Much better now since e52b4a8503d802e0da00c921beb5ab154d5e5815, please take it for a spin.
Looks much better, got one thing: the start URL is tested twice, I'll check that later today:
$ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 2 [2016-05-10 12:25:59] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.22.0 [2016-05-10 12:26:00] Starting firefox for analysing https://www.sitespeed.io/ 1 time(s) [2016-05-10 12:26:06] Starting firefox for analysing https://www.sitespeed.io/ 1 time(s) [2016-05-10 12:26:12] Starting firefox for analysing https://www.sitespeed.io/documentation/ 1 time(s) [2016-05-10 12:26:18] Starting firefox for analysing https://www.sitespeed.io/example/ 1 time(s) [2016-05-10 12:26:23] Starting firefox for analysing https://www.sitespeed.io/faq/ 1 time(s)
Crawler also picks up JS & CSS files:
$ bin/sitespeed.js http://www.expressen.se -n 1 --crawler.enable=true --crawler.maxDepth 2 [2016-05-10 12:36:53] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.22.0 [2016-05-10 12:36:53] Starting firefox for analysing http://www.expressen.se 1 time(s) [2016-05-10 12:37:13] Starting firefox for analysing http://www.expressen.se/ 1 time(s) [2016-05-10 12:37:32] Starting firefox for analysing http://www.expressen.se/js/desktop/lte-ie9-polyfill.min__c9d3d7d92fab074e006355a6c555a3bb1.js 1 time(s) [2016-05-10 12:37:37] Starting firefox for analysing http://www.expressen.se/stylesheets/style.desktop.min__cb1b7d623b645121fc394a7f0d9b341ea.css 1 time(s) [2016-05-10 12:37:43] Starting firefox for analysing http://www.expressen.se/stylesheets/print.desktop.min__c00981082b8706cc326173900e8aedd7a.css 1 time(s) [2016-05-10 12:37:49] Starting firefox for analysing http://www.expressen.se/js/desktop/advertisement__c1531551efcc41b6308ab3647b8f92c06.js 1 time(s)
Seems to work perfect now, need to test it some more. For expressen I get a lot of:
[2016-05-11 13:12:37] Missing time from har entry for url: http://fusion.expressen.se/bnredirscrpt.js?ads=exp
from browsertime, I'll change the log level on that, let me know if you disagree :)
I think we should try to implement crawler.maxPagesToTest
, that will make it all easier (and one of the most used features for me).
Added max pages in 315ae102e1e31573c0b12ee29a306915deae1380, however I renamed it to crawler.maxPages (too keep it slightly shorter). I'm open to change it back.
Cool, did you get it to work? When I try, it takes the X amount of URLs and then just exit:
$ bin/sitespeed.js https://www.sitespeed.io -n 1 --crawler.enable=true --crawler.maxDepth 2 --crawler.maxPages 2 [2016-05-13 21:28:58] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.15 coach: 0.22.1 [2016-05-13 21:28:58] Starting firefox for analysing https://www.sitespeed.io 1 time(s) [2016-05-13 21:29:04] Starting firefox for analysing https://www.sitespeed.io/ 1 time(s) peter at hoppla in ~/git/sitespeed.io on 4.0*
It worked, sort of… Found and fixed an issue with 1779d75. Btw, crawler.maxDepth is actually crawler.depth. Also --crawler.enable=true is not needed. It's just a hack to enable the crawler (or any non-default plugin) without explicitly specifying any options. Specifying --crawler.maxPage is enough to make the plugin load.
Cool, great!
Think this is ok, lets close this for alpha1 and open a new one if we find something.
We haven't put much love in testing crawling lets test that before alpha1.