Test and verify crawl - Githubissues

soulgalore commented 8 years ago

We haven't put much love in testing crawling lets test that before alpha1.

soulgalore commented 8 years ago

Seems like something is wrong with maxdepth:

// One
$ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 1
[2016-04-25 21:36:14] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3
[2016-04-25 21:36:15] Starting firefox for analysing https://www.sitespeed.io/
[2016-04-25 21:36:15] Start to crawl from https://www.sitespeed.io/ with maxDepth 1
[2016-04-25 21:36:21] Finished analysing https://www.sitespeed.io/
// Two (only one page)
$ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 2
[2016-04-25 21:35:19] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3
[2016-04-25 21:35:20] Starting firefox for analysing https://www.sitespeed.io/
[2016-04-25 21:35:20] Start to crawl from https://www.sitespeed.io/ with maxDepth 2
[2016-04-25 21:35:26] Starting firefox for analysing https://www.sitespeed.io/
[2016-04-25 21:35:32] Finished analysing https://www.sitespeed.io/
// THREE now we start pickup URLs
$ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 3
[2016-04-25 21:35:53] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3
[2016-04-25 21:35:54] Starting firefox for analysing https://www.sitespeed.io/
[2016-04-25 21:35:54] Start to crawl from https://www.sitespeed.io/ with maxDepth 3
[2016-04-25 21:35:59] Starting firefox for analysing https://www.sitespeed.io/
[2016-04-25 21:36:05] Starting firefox for analysing https://www.sitespeed.io/documentation/
[2016-04-25 21:36:11] Starting firefox for analysing https://www.sitespeed.io/example/
....

beenanner commented 8 years ago

Looks like this could be a bug in the crawler possible. https://www.sitespeed.io appears to redirect to https://www.sitespeed.io/ which might explain why the crawler is picking up URLS after the maxDepth 3 is set.

bin/sitespeed.js https://www.sitespeed.io -n 1 --crawler.enable=true --crawler.maxDepth 2 --debug
[2016-04-27 00:01:20] Versions OS: linux 3.19.0-58-generic sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.20.3
[2016-04-27 00:01:20] {
  "uuid": "f80dcb4f-6ad9-4de6-8bae-5060d2fd93b3",
  "type": "url",
  "timestamp": "2016-04-27T00:01:20-04:00",
  "source": "url-reader",
  "data": "{...}",
  "url": "https://www.sitespeed.io"
}
[2016-04-27 00:01:20] Starting firefox for analysing https://www.sitespeed.io
[2016-04-27 00:01:20] Start to crawl from https://www.sitespeed.io with maxDepth 2
I just received https://www.sitespeed.io/ (26352 bytes)
It was a resource of type text/html
[2016-04-27 00:01:21] {
  "uuid": "8b6a005d-dff3-43f4-9e89-1c94442421b4",
  "type": "url",
  "timestamp": "2016-04-27T00:01:21-04:00",
  "source": "crawler",
  "data": "{...}",
  "url": "https://www.sitespeed.io/"
}
[2016-04-27 00:01:21] {
  "uuid": "36acdf1b-8395-4104-8c54-74fc10334cf0",
  "type": "url",
  "timestamp": "2016-04-27T00:01:21-04:00",
  "source": "crawler",
  "data": "{...}",
  "url": "https://www.sitespeed.io/"
}

tobli commented 8 years ago

Now that 0.7 of simplecrawler is released I'll push a few fixes, hopefully tomorrow.

tobli commented 8 years ago

Much better now since e52b4a8503d802e0da00c921beb5ab154d5e5815, please take it for a spin.

soulgalore commented 8 years ago

Looks much better, got one thing: the start URL is tested twice, I'll check that later today:

$ bin/sitespeed.js https://www.sitespeed.io/ -n 1 --crawler.enable=true --crawler.maxDepth 2
[2016-05-10 12:25:59] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.22.0
[2016-05-10 12:26:00] Starting firefox for analysing https://www.sitespeed.io/ 1 time(s)
[2016-05-10 12:26:06] Starting firefox for analysing https://www.sitespeed.io/ 1 time(s)
[2016-05-10 12:26:12] Starting firefox for analysing https://www.sitespeed.io/documentation/ 1 time(s)
[2016-05-10 12:26:18] Starting firefox for analysing https://www.sitespeed.io/example/ 1 time(s)
[2016-05-10 12:26:23] Starting firefox for analysing https://www.sitespeed.io/faq/ 1 time(s)

soulgalore commented 8 years ago

Crawler also picks up JS & CSS files:

$ bin/sitespeed.js http://www.expressen.se -n 1 --crawler.enable=true --crawler.maxDepth 2
[2016-05-10 12:36:53] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.9 coach: 0.22.0
[2016-05-10 12:36:53] Starting firefox for analysing http://www.expressen.se 1 time(s)
[2016-05-10 12:37:13] Starting firefox for analysing http://www.expressen.se/ 1 time(s)
[2016-05-10 12:37:32] Starting firefox for analysing http://www.expressen.se/js/desktop/lte-ie9-polyfill.min__c9d3d7d92fab074e006355a6c555a3bb1.js 1 time(s)
[2016-05-10 12:37:37] Starting firefox for analysing http://www.expressen.se/stylesheets/style.desktop.min__cb1b7d623b645121fc394a7f0d9b341ea.css 1 time(s)
[2016-05-10 12:37:43] Starting firefox for analysing http://www.expressen.se/stylesheets/print.desktop.min__c00981082b8706cc326173900e8aedd7a.css 1 time(s)
[2016-05-10 12:37:49] Starting firefox for analysing http://www.expressen.se/js/desktop/advertisement__c1531551efcc41b6308ab3647b8f92c06.js 1 time(s)

soulgalore commented 8 years ago

Seems to work perfect now, need to test it some more. For expressen I get a lot of:

[2016-05-11 13:12:37] Missing time from har entry for url: http://fusion.expressen.se/bnredirscrpt.js?ads=exp

from browsertime, I'll change the log level on that, let me know if you disagree :)

soulgalore commented 8 years ago

I think we should try to implement crawler.maxPagesToTest, that will make it all easier (and one of the most used features for me).

tobli commented 8 years ago

Added max pages in 315ae102e1e31573c0b12ee29a306915deae1380, however I renamed it to crawler.maxPages (too keep it slightly shorter). I'm open to change it back.

soulgalore commented 8 years ago

Cool, did you get it to work? When I try, it takes the X amount of URLs and then just exit:

$ bin/sitespeed.js https://www.sitespeed.io -n 1 --crawler.enable=true --crawler.maxDepth 2 --crawler.maxPages 2
[2016-05-13 21:28:58] Versions OS: darwin 15.4.0 sitespeed.io: 4.0.0 browsertime: 1.0.0-alpha.15 coach: 0.22.1
[2016-05-13 21:28:58] Starting firefox for analysing https://www.sitespeed.io 1 time(s)
[2016-05-13 21:29:04] Starting firefox for analysing https://www.sitespeed.io/ 1 time(s)
peter at hoppla in ~/git/sitespeed.io on 4.0*

tobli commented 8 years ago

It worked, sort of… Found and fixed an issue with 1779d75. Btw, crawler.maxDepth is actually crawler.depth. Also --crawler.enable=true is not needed. It's just a hack to enable the crawler (or any non-default plugin) without explicitly specifying any options. Specifying --crawler.maxPage is enough to make the plugin load.

soulgalore commented 8 years ago

Cool, great!

soulgalore commented 8 years ago

Think this is ok, lets close this for alpha1 and open a new one if we find something.

sitespeedio / sitespeed.io

Test and verify crawl #896