localnerve / grunt-html-snapshots

Grunt task for html-snapshots
MIT License
16 stars 3 forks source link

timeout question #1

Closed jonascript closed 11 years ago

jonascript commented 11 years ago

First of all thanks for making this task. It's great for helping make ajax content crawlable.

For the timeout param, if you specify a number (ie 20000) is it supposed to be the combined timeout for all pages retrieved from the input param. In my case, I'm pulling pages from a sitemap. I noticed that when I set the timeout to 5000 (or left default) in the grunt task, it will throw an error out after 5 seconds even though the pages a retrieved in about 500 ms each (10 pages).

I checked the docs for html_snapshot, but it appears that the timeout param in the htm_snapshots.run is for individual pages.

Any help or info is greatly appreciated.

localnerve commented 11 years ago

What version of the html-snapshots library is your grunt-html-snapshots using? I recommend 0.2.0, because before that I implemented a file watcher which was unreliable at best. Since 0.2.0, it uses a polling scheme which performs much better for this task. To get grunt-html-snapshots with html-snapshots 0.2.0, you should just be able to update to the latest version of grunt-html-snapshots, and it will fix itself. As for the timeout, there are three types of timeout values you can supply: number, object, or function. If you supply a number (or take the default) the same value is applied to each page (default is 5 for each). If you supply an object, it is treated as a key/value pair where the the key must match the url read from your sitemap, and the value must be a timeout number value. If you supply a function, it is called back and passed the url from your sitemap and you return a proper timeout number from your function. Also, remember that you will also get a timeout if the selector you specified never actually shows up in the page. One way to test this is to just set the selector to 'body' (that is the default). The page will be considered ready for a snapshot as soon as the phantomjs script finds the body visible. If the timeout goes away after that, you know it is just not finding the selector you specified for that page. Hope this helps, Alex

localnerve commented 11 years ago

Are you still having problems with this?

jonascript commented 11 years ago

Apologies for the delayed response. I was able to get this working by increasing the time limit to 20000, which since I currently have a small number of pages (< 20) is ok. I don't think it's a fix for the issue but a workaround. My config is below...

options: { input: "sitemap", source: "http://www.mysite.com/sitemap.xml", hostname: ""http://www.mysite.com/", selector: { "__default": "body[data-status=ready]", "/": "body[data-status=ready]" }, outputDirClean: true, phantomjs: "/usr/local/bin/phantomjs", timeout: 20000 },

localnerve commented 11 years ago

I can't seem to reproduce this. Do you have a minimum breaking example? It should not use your actual site or an external version of phantomjs that cannot be verified.

localnerve commented 11 years ago

FWIW, here are the options and html I used to try to reproduce this. My install used PhantomJS 1.9.2-1 that came with html-snapshots. Options:

    html_snapshots: {
      all: {
        options: {
          selector: "body[data-status=ready]",
          outputDirClean: true,
          outputDir: "./snapshots",
          input: "array",
          source: ["http://content.local/minimal/"]
        }
      }
    }

Html (from http://content.local/minimal/):

<!doctype html>
<html>
<head>
  <title>minimal example</title>
</head>
<body data-status="ready">
  <h1>Hello Minimal Example</h1>
  <p>
    this is a minimal example
  </p>
  <script src="vendor/bower/jquery/jquery.js"></script>
</body>
</html>
jonascript commented 11 years ago

Please see my below grunt config and the output. When I set the timeout to 5000, it gives an error when retrieving the pages from the sitemap.xml although not for actually processing each page. When I set it to a high value (20000), it goes through the whole sitemap, which seems to indicate that the timeout is applying to the retrieval of the sitemap as well as the individual pages. Hope this helps.

html_snapshots: {

        // options for all targets
        options: {
            input: "sitemap",
          source: "http://ntrsctn.com/sitemap.xml",
          hostname: "ntrstn.com"
          selector: { "__default": "body[data-status=ready]", "/": "body[data-status=ready]" },
          outputDirClean: true,
          phantomjs: "/usr/local/bin/phantomjs",
          timeout: 5000
        },
        // the debug target
        debug: {
          options: {
            outputDir: "./grunt-snapshots-test"
          }
        },
        // the release target */
        release: {
          options: {
            outputDir: "./snapshots"
          }
        }
    },
Creating snapshots for: ntrsctn.com
Running "html_snapshots:release" (html_snapshots) task
http://ntrsctn.com/sitemap.xml response: 200
Creating snapshot for http://ntrsctn.com/music...
>> html_snapshots failed
Warning: Task "html_snapshots:release" failed. Use --force to continue.

Aborted due to warnings.
Creating snapshot for http://ntrsctn.com/...
Creating snapshot for http://ntrsctn.com/entertainment...
Creating snapshot for http://ntrsctn.com/styleart...
Creating snapshot for http://ntrsctn.com/hottopic/terry-richardson...
Creating snapshot for http://ntrsctn.com/hottopic/pusha-t...
Creating snapshot for http://ntrsctn.com/hottopic/coffee...
Creating snapshot for http://ntrsctn.com/sports...
Creating snapshot for http://ntrsctn.com/gaming...
Creating snapshot for http://ntrsctn.com/hottopic/broncos...
Creating snapshot for http://ntrsctn.com/hottopic/levis...
Creating snapshot for http://ntrsctn.com/sneakers...
snapshot for http://ntrsctn.com/music finished in 516 ms
  written to snapshots/music/index.html
snapshot for http://ntrsctn.com/entertainment finished in 500 ms
  written to snapshots/entertainment/index.html
snapshot for http://ntrsctn.com/ finished in 500 ms
  written to snapshots/index.html
snapshot for http://ntrsctn.com/hottopic/coffee finished in 505 ms
  written to snapshots/hottopic/coffee/index.html
snapshot for http://ntrsctn.com/hottopic/terry-richardson finished in 501 ms
  written to snapshots/hottopic/terry-richardson/index.html
snapshot for http://ntrsctn.com/gaming finished in 515 ms
  written to snapshots/gaming/index.html
snapshot for http://ntrsctn.com/hottopic/broncos finished in 514 ms
  written to snapshots/hottopic/broncos/index.html
snapshot for http://ntrsctn.com/hottopic/levis finished in 502 ms
  written to snapshots/hottopic/levis/index.html
snapshot for http://ntrsctn.com/sneakers finished in 507 ms
  written to snapshots/sneakers/index.html
snapshot for http://ntrsctn.com/styleart finished in 512 ms
  written to snapshots/styleart/index.html
snapshot for http://ntrsctn.com/sports finished in 500 ms
  written to snapshots/sports/index.html
snapshot for http://ntrsctn.com/hottopic/pusha-t finished in 508 ms
  written to snapshots/hottopic/pusha-t/index.html
localnerve commented 11 years ago

Thanks for this opportunity. This example allowed me to see two bugs in html-snapshots that I fixed and will get published asap. 1) It was possible for the file polling to start before the end of input. 2) The times were being grossly misreported by the phantomjs script. For #2, the time used to only start after the page.open callback. Now that I've moved the start of timing to just before the page.open call, the times are actually representative of the total load time. For the site you posted above, sometimes some of the pages are taking 18+ seconds. It varied quite a bit.

localnerve commented 11 years ago

With the fix for localnerve/html-snapshots#10, I was successfully able to snapshot the site. I used the following options:

  html_snapshots: {
    options: {
      input: "sitemap",
      source: "http://ntrsctn.com/sitemap.xml",
      selector: "body[data-status=ready]",
      outputDirClean: true,
      timeout: 25000
    },
    debug: {
      options: {
        outputDir: "./snapshots/debug"
      }
    },
    release: {
      options: {
        outputDir: "./snapshots/release"
      }
    }
  }

You will need the latest version of grunt-html-snapshots to see the time information that includes the page load time experienced by the phantomjs script. The time disparity came from the times counting down from the start of the request in the parent process, but the child phantomjs scripts were only counting from after page load. Now they're not exact, but they're much closer. Keep in mind these are just times that are experienced by the parallel phantomjs child processes, and the default phantomjs script that they run. Most of the times I experienced were less than 10 seconds, but every so often, I would get some big outlier (like 18 seconds). So I just kept the timeout high to cover this case.

localnerve commented 11 years ago

This was last week, so I'm going to close. Please don't hesitate to re-open or open another if you find something was missed.

jonascript commented 11 years ago

Sorry for the delayed response. Very busy couple of weeks! Thank you for making the fix. I tested it and it works great.

FYI, on my local, the load times are around 10 seconds, but on the server they're more around 500ms. Not sure why there's that discrepancy, but since this module runs from the live server it's not currently an issue. Thanks again!

localnerve commented 11 years ago

If you can, I'd really appreciate a brief overview of how you are using this from a "live" server. I'm unfamiliar with this use case, and should know more about it. With the little knowledge I have, I'm interested in:

  1. Why you can't/don't-want-to pre-build the snapshots.
  2. What is your stack (I presume NodeJS on some app service with ExpressJS).
  3. Would it be better for this use case if the processing was not async?
  4. Do you only build the snapshots once (on the first search engine request)? On a scheduled worker?

Really, any info you can share to get me started understanding about this use case would be awesome.

jonascript commented 11 years ago

Sure, by live server I mean we're running this on the production server as opposed to a dev environment or local environment.

  1. We are prebuilding the snapshots using html-snapshots and then server them up to search engines using the google escaped fragment solution.
  2. We're using PHP & Redis, but to generate the snapshots, we're using node & grunt and storing them on the filesystem using your module.
  3. It's possible it would help avoid some issues if there was an synchronous option.
  4. We build them regularly via a cron job.

Please just lmk if you have anymore questions. Thanks again for your help.

Jonathan Crockett* Technical Lead - R&D | Complex Media, Inc. E: jonathanc@complex.com T: 917-262-3122

1271 Avenue of the Americas, 35th FL. New York, NY 10020

On Tue, Oct 22, 2013 at 5:40 PM, Alex Grant notifications@github.comwrote:

If you can, I'd really appreciate a brief overview of how you are using this from a "live" server. I'm unfamiliar with this use case, and should know more about it. With the little knowledge I have, I'm interested in:

  1. Why you can't/don't-want-to pre-build the snapshots.
  2. What is your stack (I presume NodeJS on some app service with ExpressJS).
  3. Would it be better for this use case if the processing was not async?
  4. Do you only build the snapshots once (on the first search engine request)? On a scheduled worker?

Really, any info you can share to get me started understanding about this use case would be awesome.

— Reply to this email directly or view it on GitHubhttps://github.com/localnerve/grunt-html-snapshots/issues/1#issuecomment-26853168 .

localnerve commented 10 years ago

@jonascript Your info has been very helpful: I've started using this on Heroku using Redis and Node building regularly via a cron job. I've added some new features to help with this:

  1. A processLimit option allows you to control how many phantomjs processes ever run at once (1 forces sequential snapshots). Helpful for some environments, and with predictability for apps with dynamic routes.
  2. A callback argument was added that allows you to do something else with the snapshots (other than leave the snapshots on the filesystem). I actually store my snapshots in Redis, using the incoming route as the key to the content. This is required for a scalable environment like Heroku.

A full write up on my technique is available here