jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
https://github.com/metainspector/metainspector
MIT License
1.03k stars 165 forks source link

A url that can't be scrapped (mobile version of the website) #194

Closed Abdelhady closed 5 years ago

Abdelhady commented 7 years ago

This Url can't be scrapped, and gives no results at all (may be because it is the mobile version of that website!)

jaimeiniesta commented 7 years ago

Hello!

I can't reproduce, what version of MetaInspector are you using? I've tried with 5.3.1 and it goes fine, the demo also works:

https://metainspectordemo.herokuapp.com/scrape?url=http%3A%2F%2Fmobile.nytimes.com%2F2016%2F11%2F15%2Fopinion%2Fmark-zuckerberg-is-in-denial.html

Abdelhady commented 7 years ago

I'm currently using 5.0.1, and actually the demo is not working, it is not giving any title/description/images or anything else!

try the non-mobile version of the same url, it will give you the correct results

jaimeiniesta commented 7 years ago

It looks like the NY Times server is returning different content, probably based on the request IP.

You're right that the demo doesn't show anything now, but it did when I tried before.

Also, when I try from my development machine, I can get all the information:

2.3.1 :008 > p.title
 => "Mark Zuckerberg Is in Denial - NYTimes.com"
2.3.1 :009 > p.description
 => "CHAPEL HILL, N.C. — Donald J. Trump’s supporters were probably heartened in September, when, according to an article shared nearly a million times on Facebook, the candidate received an endorsement from Pope Francis. Their opinions on Hillary Clinton may have soured even further after reading a Denver Guardian article that also spread widely on Facebook, which reported days before the election that an F.B.I. agent suspected of involvement in leaking Mrs. Clinton’s emails was found dead in an apparent murder-suicide."

Are you trying from a server or from your dev machine? I suggest trying from a different computer and see if it works there, in that case I'm afraid we can't fix it in the code; it the remote server returns different content based on the location from where the request is made, then that's the only HTML we can parse.

You could also try setting a different User-Agent string, maybe the server returns a different content based on that.

https://github.com/jaimeiniesta/metainspector#headers

Abdelhady commented 7 years ago

Well, the first time I figured it out was on our production environment located in "US East (N. Virginia)" region, but then my development machine gave the same empty results, at first I suspected the older version I'm using (5.0.1), that is why I've tried the demo which gave the same results to me,

I think it is somehow related to NY Times' mobile version, because their normal version is working fine with me in both production env. & dev. machine.

jaimeiniesta commented 7 years ago

There's definitely something weird with that URL, now it's failing in my dev machine as well.

It still depends on what the server returns, which seems to be changing as it sometimes worked fine for me.

Now, what I see is a lot of scrambled text instead of a document:

2.2.4 :020 > p.url
 => "http://mobile.nytimes.com/2016/11/15/opinion/mark-zuckerberg-is-in-denial.html"
2.2.4 :021 > p.title
 => ""
2.2.4 :022 > p.to_s
 => "\u001F\x8B\b\u0000\u0000\u0000\u0000\u0000\u0000\u0003\xEC\xBDݒ\xDBH\xB6.v\xEF\xA7@s\xDCRqD\x80\u0000H\xF0\xA7JT\x9Fj\xB5zZ\xFBHݲ\xA4\xEE\x9E\xD9\xDA\u001A\u0005H\x82E\xB4@\x82\u0003\x80U*\x95jb\xDF\xF9\u0005\u001C\x8Ep\xC4\xF1#\xF8\xC2w\xBE\xF7\x9B\xECp\xF89\xBC\xBE\x95\t \xF1G\xB2J\xA59\xB3\u001D\xBD{O\t\u00042W\xAE\\\xB9\xFE\xF3\xEF\xE1W\xDF\xFD\xF4\xF8\xF5_^<і\xC9*x\xF4\u0010\u007F\xB5\xC0]\x9FMZ\u07BA\xA5\xCD\u00027\x8E'\xAD\u0016}\xF0\xDC\xF9\xA3\x87+/q\xA9d\xB2ѽ\

(...)
jaimeiniesta commented 7 years ago

Proof that it's an intermittent error:

captura de pantalla 2016-11-21 a las 22 20 14