jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
https://github.com/metainspector/metainspector
MIT License
1.03k stars 165 forks source link

og:title tag not found, title returning nothing #244

Closed macmartine closed 5 years ago

macmartine commented 5 years ago

This page has a title tag and an og:title tage, and the gem still returns no titles:

http://www.ama-pdx.org/event/virtual-reality-marketing-strategy/

[4] pry(UrlAdder)> page.url => "http://www.ama-pdx.org/event/virtual-reality-marketing-strategy/" [5] pry(UrlAdder)> page.title => "" [6] pry(UrlAdder)> page.best_title => nil [7] pry(UrlAdder)> page.description => nil

jschwindt commented 5 years ago

That page returns just a basic html that calls an IFRAME, that's why you don't get any data. Apparently it's expecting a session cookie in order to return the whole HTML that you see in the browser:

> curl http://www.ama-pdx.org/event/virtual-reality-marketing-strategy/

Returns:

<html style="height:100%">
<head>
  <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
  <meta name="format-detection" content="telephone=no">
  <meta name="viewport" content="initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
  <iframe src="/_Incapsula_Resource?SWUDNSAI=9&xinfo=9-67653559-0%200NNN%20RT%281552353546321%200%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U19&incident_id=1241000040126355380-254073862832586825&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 1241000040126355380-254073862832586825</iframe>
</body>
</html>
jaimeiniesta commented 5 years ago

Thanks for the report @macmartine and for the explanation @jschwindt

macmartine commented 5 years ago

@jschwindt That's not what I see when I view source.

jschwindt commented 5 years ago

@jschwindt That's not what I see when I view source.

That's because the browser is sending the cookie that the site expects. For example: this request response with the whole HTML because I copied the cookie from Chrome:

curl 'http://www.ama-pdx.org/event/virtual-reality-marketing-strategy/' \ -H 'Cookie: incap_ses_1241_1673620=NSFZCcLSdCLIIrse9uo4ER3xh1wAAAAA+xcMOna9Ra7KLStuHXuFWA=='

macmartine commented 5 years ago

Okay, thanks. How does Slack unfurl it then?

jschwindt commented 5 years ago

The site has some kind of "protection" to avoid robots (I'm guessing) and perhaps there is a way to avoid it that Slack knows about... This is what I saw the first time I tried to browse the page:

Screen Shot 2019-03-12 at 14 49 37