Closed 4oo4 closed 6 years ago
My guess (after few tests) is that the HTML from CNN is now a real piece of shit with too much styles & scripts inlined (just check the source yourself it's really ugly) and the parser can't properly parse the HTML which means we then can't extract data from it.
Hmm, their source is super ugly but I don't remember it looking very different from a couple of months ago when looking at the issue of it redirecting to an unsupported browser page. What's weird is that when I run it with f43.me, it seems to get the title and other <meta>
tags OK, but then they don't show up in the final result. Based on that I was wondering if it might be some kind of redirect again (possibly to the unsupported browser page like before).
{
"og_pubdate": "2018-08-26T18:20:59Z",
"og_url": "https://www.cnn.com/2018/08/26/us/jacksonville-madden-shooting/index.html",
"og_title": "Mass shooting at video game tournament in Jacksonville leaves multiple dead",
"og_description": "Multiple people were killed in a shooting during a video game tournament at a shopping and dining complex in downtown Jacksonville, Florida, the Jacksonville Sheriff's Office said Sunday afternoon.",
"og_site_name": "CNN",
"og_type": "article",
"og_image": "https://cdn.cnn.com/cnnnext/dam/assets/180826145832-01-jacksonville-shooting-0826-super-tease.jpg",
"og_image_width": "1100",
"og_image_height": "619"
}
Actually just noticed something, they seem to start working with f43.me when I switch the parser to 'External'. It's too bad there's no source code for the Mercury parser, it would be interesting to see what that's doing differently.
Is there any way to make graby dump what it actually parsed on a failure? I've spent far too much time on this (don't even really read CNN except for breaking news lol), but I'm frustrated that it went back to not working after getting fixed with 15aa9c6519124c748aaa1f5f8c7fd569f1823c73. Before, I know that I could see the unsupported browser page URL in the debug logs, I really want to know what happens in between when it appears to parse the OpenGraph data correctly but then fails to come up with anything.
The problem seems to come from Readability. There are pre filters there to hard remove code from the html page, like style & script tags: https://github.com/j0k3r/php-readability/blob/master/src/Readability.php#L122
And it seems that removing the style tag (which are god too heavy on cnn) seems to remove the whole page. And that's why nothing come out from graby.
pre_filters
parameters are defined globally and not on per site_config basis.
Argh, I just remembered that they have m.cnn.com, the source on that is way cleaner and is parsable. Instead of messing with Readability filters I can just use those URLs.
Thank you for taking the time to look!
I know that the issue with the IE conditional was recently fixed, but just discovered that it's no longer working. It's not getting redirected to the "Unsupported browser" page like before, however. From poking around their site in dev tools, the layout hasn't changed at all. By messing with it on f43.me, the only thing I'm able to see is that when it tries to grab the exact same div as before, the content-length is way smaller than it should be.