Closed jonascript closed 11 years ago
What version of the html-snapshots library is your grunt-html-snapshots using? I recommend 0.2.0, because before that I implemented a file watcher which was unreliable at best. Since 0.2.0, it uses a polling scheme which performs much better for this task. To get grunt-html-snapshots with html-snapshots 0.2.0, you should just be able to update to the latest version of grunt-html-snapshots, and it will fix itself. As for the timeout, there are three types of timeout values you can supply: number, object, or function. If you supply a number (or take the default) the same value is applied to each page (default is 5 for each). If you supply an object, it is treated as a key/value pair where the the key must match the url read from your sitemap, and the value must be a timeout number value. If you supply a function, it is called back and passed the url from your sitemap and you return a proper timeout number from your function. Also, remember that you will also get a timeout if the selector you specified never actually shows up in the page. One way to test this is to just set the selector to 'body' (that is the default). The page will be considered ready for a snapshot as soon as the phantomjs script finds the body visible. If the timeout goes away after that, you know it is just not finding the selector you specified for that page. Hope this helps, Alex
Are you still having problems with this?
Apologies for the delayed response. I was able to get this working by increasing the time limit to 20000, which since I currently have a small number of pages (< 20) is ok. I don't think it's a fix for the issue but a workaround. My config is below...
options: { input: "sitemap", source: "http://www.mysite.com/sitemap.xml", hostname: ""http://www.mysite.com/", selector: { "__default": "body[data-status=ready]", "/": "body[data-status=ready]" }, outputDirClean: true, phantomjs: "/usr/local/bin/phantomjs", timeout: 20000 },
I can't seem to reproduce this. Do you have a minimum breaking example? It should not use your actual site or an external version of phantomjs that cannot be verified.
FWIW, here are the options and html I used to try to reproduce this. My install used PhantomJS 1.9.2-1 that came with html-snapshots. Options:
html_snapshots: {
all: {
options: {
selector: "body[data-status=ready]",
outputDirClean: true,
outputDir: "./snapshots",
input: "array",
source: ["http://content.local/minimal/"]
}
}
}
Html (from http://content.local/minimal/):
<!doctype html>
<html>
<head>
<title>minimal example</title>
</head>
<body data-status="ready">
<h1>Hello Minimal Example</h1>
<p>
this is a minimal example
</p>
<script src="vendor/bower/jquery/jquery.js"></script>
</body>
</html>
Please see my below grunt config and the output. When I set the timeout to 5000, it gives an error when retrieving the pages from the sitemap.xml although not for actually processing each page. When I set it to a high value (20000), it goes through the whole sitemap, which seems to indicate that the timeout is applying to the retrieval of the sitemap as well as the individual pages. Hope this helps.
html_snapshots: {
// options for all targets
options: {
input: "sitemap",
source: "http://ntrsctn.com/sitemap.xml",
hostname: "ntrstn.com"
selector: { "__default": "body[data-status=ready]", "/": "body[data-status=ready]" },
outputDirClean: true,
phantomjs: "/usr/local/bin/phantomjs",
timeout: 5000
},
// the debug target
debug: {
options: {
outputDir: "./grunt-snapshots-test"
}
},
// the release target */
release: {
options: {
outputDir: "./snapshots"
}
}
},
Creating snapshots for: ntrsctn.com
Running "html_snapshots:release" (html_snapshots) task
http://ntrsctn.com/sitemap.xml response: 200
Creating snapshot for http://ntrsctn.com/music...
>> html_snapshots failed
Warning: Task "html_snapshots:release" failed. Use --force to continue.
Aborted due to warnings.
Creating snapshot for http://ntrsctn.com/...
Creating snapshot for http://ntrsctn.com/entertainment...
Creating snapshot for http://ntrsctn.com/styleart...
Creating snapshot for http://ntrsctn.com/hottopic/terry-richardson...
Creating snapshot for http://ntrsctn.com/hottopic/pusha-t...
Creating snapshot for http://ntrsctn.com/hottopic/coffee...
Creating snapshot for http://ntrsctn.com/sports...
Creating snapshot for http://ntrsctn.com/gaming...
Creating snapshot for http://ntrsctn.com/hottopic/broncos...
Creating snapshot for http://ntrsctn.com/hottopic/levis...
Creating snapshot for http://ntrsctn.com/sneakers...
snapshot for http://ntrsctn.com/music finished in 516 ms
written to snapshots/music/index.html
snapshot for http://ntrsctn.com/entertainment finished in 500 ms
written to snapshots/entertainment/index.html
snapshot for http://ntrsctn.com/ finished in 500 ms
written to snapshots/index.html
snapshot for http://ntrsctn.com/hottopic/coffee finished in 505 ms
written to snapshots/hottopic/coffee/index.html
snapshot for http://ntrsctn.com/hottopic/terry-richardson finished in 501 ms
written to snapshots/hottopic/terry-richardson/index.html
snapshot for http://ntrsctn.com/gaming finished in 515 ms
written to snapshots/gaming/index.html
snapshot for http://ntrsctn.com/hottopic/broncos finished in 514 ms
written to snapshots/hottopic/broncos/index.html
snapshot for http://ntrsctn.com/hottopic/levis finished in 502 ms
written to snapshots/hottopic/levis/index.html
snapshot for http://ntrsctn.com/sneakers finished in 507 ms
written to snapshots/sneakers/index.html
snapshot for http://ntrsctn.com/styleart finished in 512 ms
written to snapshots/styleart/index.html
snapshot for http://ntrsctn.com/sports finished in 500 ms
written to snapshots/sports/index.html
snapshot for http://ntrsctn.com/hottopic/pusha-t finished in 508 ms
written to snapshots/hottopic/pusha-t/index.html
Thanks for this opportunity. This example allowed me to see two bugs in html-snapshots that I fixed and will get published asap. 1) It was possible for the file polling to start before the end of input. 2) The times were being grossly misreported by the phantomjs script. For #2, the time used to only start after the page.open callback. Now that I've moved the start of timing to just before the page.open call, the times are actually representative of the total load time. For the site you posted above, sometimes some of the pages are taking 18+ seconds. It varied quite a bit.
With the fix for localnerve/html-snapshots#10, I was successfully able to snapshot the site. I used the following options:
html_snapshots: {
options: {
input: "sitemap",
source: "http://ntrsctn.com/sitemap.xml",
selector: "body[data-status=ready]",
outputDirClean: true,
timeout: 25000
},
debug: {
options: {
outputDir: "./snapshots/debug"
}
},
release: {
options: {
outputDir: "./snapshots/release"
}
}
}
You will need the latest version of grunt-html-snapshots to see the time information that includes the page load time experienced by the phantomjs script. The time disparity came from the times counting down from the start of the request in the parent process, but the child phantomjs scripts were only counting from after page load. Now they're not exact, but they're much closer. Keep in mind these are just times that are experienced by the parallel phantomjs child processes, and the default phantomjs script that they run. Most of the times I experienced were less than 10 seconds, but every so often, I would get some big outlier (like 18 seconds). So I just kept the timeout high to cover this case.
This was last week, so I'm going to close. Please don't hesitate to re-open or open another if you find something was missed.
Sorry for the delayed response. Very busy couple of weeks! Thank you for making the fix. I tested it and it works great.
FYI, on my local, the load times are around 10 seconds, but on the server they're more around 500ms. Not sure why there's that discrepancy, but since this module runs from the live server it's not currently an issue. Thanks again!
If you can, I'd really appreciate a brief overview of how you are using this from a "live" server. I'm unfamiliar with this use case, and should know more about it. With the little knowledge I have, I'm interested in:
Really, any info you can share to get me started understanding about this use case would be awesome.
Sure, by live server I mean we're running this on the production server as opposed to a dev environment or local environment.
Please just lmk if you have anymore questions. Thanks again for your help.
Jon
Jonathan Crockett* Technical Lead - R&D | Complex Media, Inc. E: jonathanc@complex.com T: 917-262-3122
1271 Avenue of the Americas, 35th FL. New York, NY 10020
On Tue, Oct 22, 2013 at 5:40 PM, Alex Grant notifications@github.comwrote:
If you can, I'd really appreciate a brief overview of how you are using this from a "live" server. I'm unfamiliar with this use case, and should know more about it. With the little knowledge I have, I'm interested in:
- Why you can't/don't-want-to pre-build the snapshots.
- What is your stack (I presume NodeJS on some app service with ExpressJS).
- Would it be better for this use case if the processing was not async?
- Do you only build the snapshots once (on the first search engine request)? On a scheduled worker?
Really, any info you can share to get me started understanding about this use case would be awesome.
— Reply to this email directly or view it on GitHubhttps://github.com/localnerve/grunt-html-snapshots/issues/1#issuecomment-26853168 .
@jonascript Your info has been very helpful: I've started using this on Heroku using Redis and Node building regularly via a cron job. I've added some new features to help with this:
processLimit
option allows you to control how many phantomjs processes ever run at once (1 forces sequential snapshots). Helpful for some environments, and with predictability for apps with dynamic routes.A full write up on my technique is available here
First of all thanks for making this task. It's great for helping make ajax content crawlable.
For the timeout param, if you specify a number (ie 20000) is it supposed to be the combined timeout for all pages retrieved from the input param. In my case, I'm pulling pages from a sitemap. I noticed that when I set the timeout to 5000 (or left default) in the grunt task, it will throw an error out after 5 seconds even though the pages a retrieved in about 500 ms each (10 pages).
I checked the docs for html_snapshot, but it appears that the timeout param in the htm_snapshots.run is for individual pages.
Any help or info is greatly appreciated.