mozilla / page-metadata-parser

DEPRECATED - A Javascript library for parsing metadata on a web page.
https://www.npmjs.com/package/page-metadata-parser
Mozilla Public License 2.0
271 stars 42 forks source link

Handle duplicate meta tags? #53

Closed pdehaan closed 2 years ago

pdehaan commented 8 years ago

Another fascinating look into the life of @pdehaan...

Scraping http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/ returns some unexpected expected results. Not sure if we can make this better, or falls under the "don't try and fix content" banner.

Like most WordPress sites, the user is probably using 9 different plugins which are trying to improve SEO and garbage like that. This leaves us with some interesting and conflicting data. By my guess, we yank the first matching DOM rule and ignore the rest, but in this edgey edge case that seems a bit suboptimal.

<meta> tags include, but are not limited to:

<!-- Open Graph tags generated by Open Graph Metabox for WordPress -->
<meta property="fb:app_id" content="689951950" />
<meta property="og:description" content="&nbsp;" />
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg" />
<meta property="og:title" content="Easy Falafel Recipe Made in a Vitamix or Blendtec Blender" />
<meta property="og:type" content="blog" />
<meta property="og:url" content="http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/" />
<!-- /Open Graph tags generated by Open Graph Metabox for WordPress -->

<!-- WordPress Facebook Open Graph protocol plugin (WPFBOGP v2.0.13) http://rynoweb.com/wordpress-plugins/ -->
<meta property="fb:admins" content="689951950"/>
<meta property="og:description" content="Check out our latest Blender Giveaway!  EASY FALAFEL RECIPE If you&#039;re unfamiliar with this Middle Eastern street food, you&#039;re in luck! This relatively easy v"/>
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg"/>
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg"/>
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg"/>
<meta property="og:locale" content="en_us"/>
<meta property="og:site_name" content="Blender Babes - Healthy Smoothie Recipes | Blendtec vs Vitamix Reviews"/>
<meta property="og:title" content="Easy Falafel Recipe"/>
<meta property="og:type" content="article"/>
<meta property="og:url" content="http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/"/>
<!-- // end wpfbogp -->

I sorted for easy reference, but the highlights are:

  1. First cluster of meta tags returns &nbsp; for the og:description
  2. Second cluster of meta tags returns a valid-ish (albeit spammy) og:description
  3. First cluster returns blog for the og:type
  4. Second cluster returns article for the og:type

This may not be worth the effort. Only solution I can think of, is that we return an array of matching selectors, and then have to try and calculate what the "best" result is, either by content length or something else. Also in the hideous mess of <meta> tags above is an og:image which is duplicated four times with the same value and then the confusing og:type issue above where we'd really just need to randomly return a type, or return an array of values, or whatever.