j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Error grabing content - science.org - 302 Status #266

Closed techexo closed 2 years ago

techexo commented 2 years ago

Hello all, I am stumbling on a issue with the new science.org website. They recently changed their website engine and, unfortunately, grabbing articles doesn't work since then. I have created a small config file:

title: //h1[@class='news-article__hero__title']
date: //span[@class='news-article__hero__date']
body: //article[@class='news-article-content']

test_url: https://www.science.org/content/blog-post/hiring-away

but trying to fetch the article with my "graby tester", I cannot retrieve the content with status 302 (redirection).

array (
  'status' => 302,
  'html' => '[unable to retrieve full-text content]',
  'title' => 'No title found',
  'language' => NULL,
  'date' => NULL,
  'authors' => 
  array (
  ),
  'url' => 'https://www.science.org/content/blog-post/hiring-away?cookieSet=1',
  'image' => NULL,
  'native_ad' => false,
  'headers' => 
  array (
    'date' => 'Mon, 20 Sep 2021 07:50:41 GMT',
    'content-type' => 'text/html; charset=utf-8',
    'transfer-encoding' => 'chunked',
    'connection' => 'keep-alive',
    'cache-control' => 'private',
    'x-xss-protection' => '1; mode=block',
    'x-content-type-options' => 'nosniff',
    'strict-transport-security' => 'max-age=0; includeSubDomains',
    'x-frame-options' => 'SAMEORIGIN',
    'location' => 'https://www.science.org/content/blog-post/hiring-away',
    'cf-cache-status' => 'DYNAMIC',
    'expect-ct' => 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
    'server' => 'cloudflare',
    'cf-ray' => '691976db6afc3312-CDG',
    'alt-svc' => 'h3=":443"; ma=86400, h3-29=":443"; ma=86400, h3-28=":443"; ma=86400, h3-27=":443"; ma=86400',
  ),
  'summary' => '[unable to retrieve full-text content]',
)

What's weird is that I have a "graby tester" not udpated since 2018, and with that one I can retrieve the content of the article with a Status 200. Have there been some modifications in the code not to follow redirections, maybe? Thanks in advance. I am available to discuss it more in French if needed, on a salon of your choice.

techexo commented 2 years ago

Looking at a diff between the two Graby.php (which I'm not competent to interpret more than that): image

Maybe the "old" one was able or configured to follow redirections while the new one has not?

j0k3r commented 2 years ago

The problem was related to a cookie required by science.org