j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

cnn.com/edition.cnn.com no longer working #159

Closed 4oo4 closed 6 years ago

4oo4 commented 6 years ago

I know that the issue with the IE conditional was recently fixed, but just discovered that it's no longer working. It's not getting redirected to the "Unsupported browser" page like before, however. From poking around their site in dev tools, the layout hasn't changed at all. By messing with it on f43.me, the only thing I'm able to see is that when it tries to grab the exact same div as before, the content-length is way smaller than it should be.

[2018-07-09 12:02:23] graby.DEBUG: Graby is ready to fetch [] []
[2018-07-09 12:02:23] graby.DEBUG: . looking for site config for cnn.com in primary folder {"host":"cnn.com"} []
[2018-07-09 12:02:23] graby.DEBUG: ... found site config cnn.com.txt {"host":"cnn.com.txt"} []
[2018-07-09 12:02:23] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:23] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:23] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 12:02:23] graby.DEBUG: Cached site config with key: cnn.com {"key":"cnn.com"} []
[2018-07-09 12:02:23] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:23] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 12:02:23] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:23] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-07-09 12:02:23] graby.DEBUG: Cached site config with key: cnn.com.merged {"key":"cnn.com.merged"} []
[2018-07-09 12:02:23] graby.DEBUG: Fetching url: https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/ {"url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:23] graby.DEBUG: Trying using method "get" on url "https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/" {"method":"get","url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:23] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:23] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/"} []
[2018-07-09 12:02:24] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://edition.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/","body":"(only length for debug): 1987844","headers":"text/html; charset=utf-8","all_headers":{"content-type":"text/html; charset=utf-8","x-servedbyhost":"::ffff:172.17.3.30","access-control-allow-origin":"*","cache-control":"max-age=60","content-security-policy":"default-src 'self' blob: https://*.cnn.com:* http://*.cnn.com:* *.cnn.io:* *.cnn.net:* *.turner.com:* *.turner.io:* *.ugdturner.com:* courageousstudio.com *.vgtf.net:*; script-src 'unsafe-eval' 'unsafe-inline' 'self' *; style-src 'unsafe-inline' 'self' blob: *; child-src 'self' blob: *; frame-src 'self' *; object-src 'self' *; img-src 'self' data: blob: *; media-src 'self' data: blob: *; font-src 'self' data: *; connect-src 'self' *; frame-ancestors 'self' https://*.cnn.com:* http://*.cnn.com https://*.cnn.io:* http://*.cnn.io:* *.turner.com:* courageousstudio.com;","x-content-type-options":"nosniff","x-xss-protection":"1; mode=block","via":"1.1 varnish, 1.1 varnish","content-length":"1988345","accept-ranges":"bytes","date":"Mon, 09 Jul 2018 17:02:24 GMT","age":"0","connection":"keep-alive","set-cookie":"countryCode=US; Domain=.cnn.com; Path=/, geoData=middletown|NY|10941|US|NA; Domain=.cnn.com; Path=/","x-served-by":"cache-iad2149-IAD, cache-msp9223-MSP","x-cache":"MISS, MISS","x-cache-hits":"0, 0","x-timer":"S1531155744.985921,VS0,VE732","vary":"Accept-Encoding, Fastly-SSL"},"status":200}} []
[2018-07-09 12:02:24] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-07-09 12:02:25] graby.DEBUG: Opengraph data: [array] {"ogData":{"og_pubdate":"2018-07-09T13:35:34Z","og_url":"https://www.cnn.com/2018/07/09/politics/steve-bannon-bookstore-harassment/index.html","og_title":"Steve Bannon called 'piece of trash' by heckler at bookstore","og_description":"Former White House chief strategist Steve Bannon became the latest figure from President Donald Drumpf's world to be targeted with public harassment while browsing books in Richmond, Virginia, on Saturday afternoon. ","og_site_name":"CNN","og_type":"article","og_image":"https://cdn.cnn.com/cnnnext/dam/assets/180523145816-steve-bannon-05-22-2018-super-tease.jpg","og_image_width":"1100","og_image_height":"619"}} []
[2018-07-09 12:02:25] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-07-09 12:02:25] graby.DEBUG: . looking for site config for edition.cnn.com in primary folder {"host":"edition.cnn.com"} []
[2018-07-09 12:02:25] graby.DEBUG: ... found site config edition.cnn.com.txt {"host":"edition.cnn.com.txt"} []
[2018-07-09 12:02:25] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:25] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: ... site config for global already loaded in this request {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: Cached site config with key: edition.cnn.com {"key":"edition.cnn.com"} []
[2018-07-09 12:02:25] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: ... site config for global already loaded in this request {"host":"global"} []
[2018-07-09 12:02:25] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 12:02:25] graby.DEBUG: Cached site config with key: edition.cnn.com.merged {"key":"edition.cnn.com.merged"} []
[2018-07-09 12:02:25] graby.DEBUG: No "single_page_link" config found [] []
[2018-07-09 12:02:25] graby.DEBUG: Attempting to extract content [] []
[2018-07-09 12:02:25] graby.DEBUG: Returning cached and merged site config for edition.cnn.com {"host":"edition.cnn.com"} []
[2018-07-09 12:02:25] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 12:02:25] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //section[contains(@class, 'body-text')] for body (content length: 232) {"pattern":"//section[contains(@class, 'body-text')]","content_length":232} []
[2018-07-09 12:02:25] graby.DEBUG: Using Readability [] []
[2018-07-09 12:02:25] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 12:02:25] graby.DEBUG: Trying again without tidy [] []
[2018-07-09 12:02:25] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 12:02:25] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 12:02:25] graby.DEBUG: Trying //section[contains(@class, 'body-text')] for body (content length: 154) {"pattern":"//section[contains(@class, 'body-text')]","content_length":154} []
[2018-07-09 12:02:25] graby.DEBUG: Using Readability [] []
[2018-07-09 12:02:25] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 12:02:25] graby.DEBUG: Success ?  {"is_success":false} []
[2018-07-09 12:02:25] graby.DEBUG: Extract failed [] []
[2018-07-09 12:02:25] app.DEBUG: DownloadImagesSubscriber: disabled. [] []
[2018-07-09 12:02:25] security.DEBUG: Stored the security token in the session. {"key":"_security_secured_area"} []
----------------------
[2018-07-09 11:59:11] app.DEBUG: Restricted access config enabled? {"enabled":0} []
[2018-07-09 11:59:11] graby.DEBUG: Graby is ready to fetch [] []
[2018-07-09 11:59:11] graby.DEBUG: . looking for site config for cnn.com in primary folder {"host":"cnn.com"} []
[2018-07-09 11:59:11] graby.DEBUG: ... found site config cnn.com.txt {"host":"cnn.com.txt"} []
[2018-07-09 11:59:11] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 11:59:11] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 11:59:11] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 11:59:11] graby.DEBUG: Cached site config with key: cnn.com {"key":"cnn.com"} []
[2018-07-09 11:59:11] graby.DEBUG: . looking for site config for global in primary folder {"host":"global"} []
[2018-07-09 11:59:11] graby.DEBUG: ... found site config global.txt {"host":"global.txt"} []
[2018-07-09 11:59:11] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-07-09 11:59:11] graby.DEBUG: Cached site config with key: global {"key":"global"} []
[2018-07-09 11:59:11] graby.DEBUG: Cached site config with key: cnn.com.merged {"key":"cnn.com.merged"} []
[2018-07-09 11:59:11] graby.DEBUG: Fetching url: https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html {"url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:11] graby.DEBUG: Trying using method "get" on url "https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html" {"method":"get","url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:11] graby.DEBUG: Use default user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2" for url "https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:11] graby.DEBUG: Use default referer "http://www.google.co.uk/url?sa=t&source=web&cd=1" for url "https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html"} []
[2018-07-09 11:59:12] graby.DEBUG: Data fetched: [array] {"data":{"effective_url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html","body":"(only length for debug): 2144783","headers":"text/html; charset=utf-8","all_headers":{"content-type":"text/html; charset=utf-8","x-servedbyhost":"::ffff:172.17.93.27","access-control-allow-origin":"*","cache-control":"max-age=60","content-security-policy":"default-src 'self' blob: https://*.cnn.com:* http://*.cnn.com:* *.cnn.io:* *.cnn.net:* *.turner.com:* *.turner.io:* *.ugdturner.com:* courageousstudio.com *.vgtf.net:*; script-src 'unsafe-eval' 'unsafe-inline' 'self' *; style-src 'unsafe-inline' 'self' blob: *; child-src 'self' blob: *; frame-src 'self' *; object-src 'self' *; img-src 'self' data: blob: *; media-src 'self' data: blob: *; font-src 'self' data: *; connect-src 'self' *; frame-ancestors 'self' https://*.cnn.com:* http://*.cnn.com https://*.cnn.io:* http://*.cnn.io:* *.turner.com:* courageousstudio.com;","x-content-type-options":"nosniff","x-xss-protection":"1; mode=block","via":"1.1 varnish, 1.1 varnish","content-length":"2145284","accept-ranges":"bytes","date":"Mon, 09 Jul 2018 16:59:12 GMT","age":"188","connection":"keep-alive","set-cookie":"countryCode=US; Domain=.cnn.com; Path=/, geoData=middletown|NY|10941|US|NA; Domain=.cnn.com; Path=/, tryThing00=0732; Domain=.cnn.com; Path=/; Expires=Mon Jul 01 2019 00:00:00 GMT, tryThing01=3094; Domain=.cnn.com; Path=/; Expires=Fri Mar 01 2019 00:00:00 GMT, tryThing02=2413; Domain=.cnn.com; Path=/; Expires=Wed Jan 01 2020 00:00:00 GMT","x-served-by":"cache-iad2150-IAD, cache-jfk8148-JFK","x-cache":"HIT, HIT","x-cache-hits":"2, 1","x-timer":"S1531155552.081472,VS0,VE4","vary":"Accept-Encoding, Fastly-SSL"},"status":200}} []
[2018-07-09 11:59:12] graby.DEBUG: Treating as UTF-8 {"encoding":"utf-8"} []
[2018-07-09 11:59:13] graby.DEBUG: Opengraph data: [array] {"ogData":{"og_pubdate":"2018-07-09T05:56:08Z","og_url":"https://www.cnn.com/2018/07/09/asia/thai-cave-rescue-intl/index.html","og_title":"Thai cave rescue suspended for the day after four more boys freed","og_description":"The second day of rescue operations at the cave site in northern Thailand has ended after four more boys were brought out of the flooded cave system Monday.","og_site_name":"CNN","og_type":"article","og_image":"https://cdn.cnn.com/cnnnext/dam/assets/180709063655-01-thai-cave-fifth-boy-rescue-0709-super-tease.jpg","og_image_width":"1100","og_image_height":"619"}} []
[2018-07-09 11:59:13] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-07-09 11:59:13] graby.DEBUG: Returning cached and merged site config for cnn.com {"host":"cnn.com"} []
[2018-07-09 11:59:13] graby.DEBUG: No "single_page_link" config found [] []
[2018-07-09 11:59:13] graby.DEBUG: Attempting to extract content [] []
[2018-07-09 11:59:13] graby.DEBUG: Returning cached and merged site config for cnn.com {"host":"cnn.com"} []
[2018-07-09 11:59:13] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 11:59:13] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' body-text ')] for body (content length: 232) {"pattern":"//section[contains(@class, ' body-text ')]","content_length":232} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' l-container ')] for body (content length: 232) {"pattern":"//section[contains(@class, ' l-container ')]","content_length":232} []
[2018-07-09 11:59:13] graby.DEBUG: Using Readability [] []
[2018-07-09 11:59:13] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 11:59:13] graby.DEBUG: Trying again without tidy [] []
[2018-07-09 11:59:13] graby.DEBUG: Strings replaced: 0 (find_string and/or replace_string) {"count":0} []
[2018-07-09 11:59:13] graby.DEBUG: Attempting to parse HTML with libxml {"parser":"libxml"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="og:title"]/@content for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@property="article:published_time"]/@content for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //html[@lang]/@lang for language {"pattern":"//html[@lang]/@lang"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //meta[@name="DC.language"]/@content for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying highlights to strip element {"string":"highlights"} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' body-text ')] for body (content length: 154) {"pattern":"//section[contains(@class, ' body-text ')]","content_length":154} []
[2018-07-09 11:59:13] graby.DEBUG: Trying //section[contains(@class, ' l-container ')] for body (content length: 154) {"pattern":"//section[contains(@class, ' l-container ')]","content_length":154} []
[2018-07-09 11:59:13] graby.DEBUG: Using Readability [] []
[2018-07-09 11:59:13] graby.DEBUG: Detected title:  {"title":""} []
[2018-07-09 11:59:13] graby.DEBUG: Success ?  {"is_success":false} []
[2018-07-09 11:59:13] graby.DEBUG: Extract failed [] []
j0k3r commented 6 years ago

My guess (after few tests) is that the HTML from CNN is now a real piece of shit with too much styles & scripts inlined (just check the source yourself it's really ugly) and the parser can't properly parse the HTML which means we then can't extract data from it.

4oo4 commented 6 years ago

Hmm, their source is super ugly but I don't remember it looking very different from a couple of months ago when looking at the issue of it redirecting to an unsupported browser page. What's weird is that when I run it with f43.me, it seems to get the title and other <meta> tags OK, but then they don't show up in the final result. Based on that I was wondering if it might be some kind of redirect again (possibly to the unsupported browser page like before).

{
    "og_pubdate": "2018-08-26T18:20:59Z",
    "og_url": "https://www.cnn.com/2018/08/26/us/jacksonville-madden-shooting/index.html",
    "og_title": "Mass shooting at video game tournament in Jacksonville leaves multiple dead",
    "og_description": "Multiple people were killed in a shooting during a video game tournament at a shopping and dining complex in downtown Jacksonville, Florida, the Jacksonville Sheriff's Office said Sunday afternoon.",
    "og_site_name": "CNN",
    "og_type": "article",
    "og_image": "https://cdn.cnn.com/cnnnext/dam/assets/180826145832-01-jacksonville-shooting-0826-super-tease.jpg",
    "og_image_width": "1100",
    "og_image_height": "619"
}
4oo4 commented 6 years ago

Actually just noticed something, they seem to start working with f43.me when I switch the parser to 'External'. It's too bad there's no source code for the Mercury parser, it would be interesting to see what that's doing differently.

Is there any way to make graby dump what it actually parsed on a failure? I've spent far too much time on this (don't even really read CNN except for breaking news lol), but I'm frustrated that it went back to not working after getting fixed with 15aa9c6519124c748aaa1f5f8c7fd569f1823c73. Before, I know that I could see the unsupported browser page URL in the debug logs, I really want to know what happens in between when it appears to parse the OpenGraph data correctly but then fails to come up with anything.

j0k3r commented 6 years ago

The problem seems to come from Readability. There are pre filters there to hard remove code from the html page, like style & script tags: https://github.com/j0k3r/php-readability/blob/master/src/Readability.php#L122

And it seems that removing the style tag (which are god too heavy on cnn) seems to remove the whole page. And that's why nothing come out from graby. pre_filters parameters are defined globally and not on per site_config basis.

4oo4 commented 6 years ago

Argh, I just remembered that they have m.cnn.com, the source on that is way cleaner and is parsable. Instead of messing with Readability filters I can just use those URLs.

Thank you for taking the time to look!