toolness / lotl-site-prototype

Node front-end for Life of The Law.
http://lifeofthelaw.org
1 stars 1 forks source link

Search result pages are not searchable by web crawlers #1

Open toolness opened 9 years ago

toolness commented 9 years ago

Because everything is powered by JS, web crawlers and tools like Readability, Pocket, etc won't be able to make sense of the site.

stenington commented 9 years ago

Crawlers

Here's an article discussing three strategies for the crawler problem:

  1. Use <noscript> to include content on the page
  2. Use hash fragments or <meta name="fragment" content="!"> to make the crawler request url with _escaped_fragment_= and respond to that differently, somehow (google-only, sort of)
  3. Detect bots and re-request the same page headlessly, sending back the results

All of these add some amount of duplication or overhead, although I guess that's sort of inherent in the task.

There's a lot more info on the _escaped_fragment_= technique here.

@toolness also mentioned that we could pipe through the old pages from Wordpress for bots or old browsers, although for bots that would mean thumbnail previews were the old theme. Thumbnails are potentially tricky for <noscript> solutions too, as simply dumping in the content without proper styles will mean thumbnails are also wrong.

Readers

I haven't found info on how Pocket works. Instapaper can read Open Graph Protocol. I assume anything done to assist crawlers will assist readers, but I'm not sure.

stenington commented 9 years ago

Maybe thumbnails aren't a big deal. I'm not sure if it's some setting I have set, but I don't actually see thumbnails of pages in my search results on google...

toolness commented 9 years ago

Since post detail pages are now pre-rendered on the server-side, I'm de-prioritizing this ticket from launch, as the search result pages don't seem like they'd be as important to get spidered/cached/etc.