gottfrois / link_thumbnailer

Ruby gem that fetches images and metadata from a given URL. Much like popular social website with link preview.
MIT License
512 stars 106 forks source link

Redirects on nytimes.com bouncing back and forth to nytimes.com/glogin ? #114

Open clairity opened 7 years ago

clairity commented 7 years ago

in trying to get nytimes.com links working with link_thumbnailer, i stumbled onto a redirect issue that seems to have been addressed before but doesn't have enough context for me to understand how to work around now. if i go to an article via link_thumbnailer:

pry> thumb = LinkThumbnailer.generate('https://www.nytimes.com/2017/03/29/us/politics/senate-intelligence-committee-burr-warner-russia-investigation.html')

i get a LinkThumbnailer::RedirectLimit: LinkThumbnailer::RedirectLimit error. if i raise the redirect_limit to 15, i get a different (but still unwanted) response:

pry> thumb = LinkThumbnailer.generate('https://www.nytimes.com/2017/03/29/us/politics/senate-intelligence-committee-burr-warner-russia-investigation.html', redirect_limit: 15)

ETHON: started MULTI
ETHON:         performed EASY effective_url=https://myaccount.nytimes.com/img/nyt-logo-379x64.svg response_code=200 return_code=write_error total_time=0.213284
ETHON: performed MULTI
#<LinkThumbnailer::Models::Website:0x007fc35a603130 @images=[#<LinkThumbnailer::Models::Image:0x007fc3608608a0 @src=#<URI::HTTPS https://myaccount.nytimes.com/img/nyt-logo-379x64.svg>, @size=[], @type=:svg>], @videos=[], @url=#<URI::HTTPS https://myaccount.nytimes.com/auth/login?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F03%2F29%2Fus%2Fpolitics%2Fsenate-intelligence-committee-burr-warner-russia-investigation.html%3F_r%3D5&REFUSE_COOKIE_ERROR=SHOW_ERROR>, @title="Log In - New York Times", @description="Don't have an account? Sign up here »", @favicon="">

so it seems that nytimes.com catches the redirect loop and breaks out of it by going to myaccount.nytimes com. if i curl the same url, i get

$ curl -v https://www.nytimes.com/2017/03/29/us/politics/senate-intelligence-committee-burr-warner-russia-investigation.html

*   Trying 151.101.25.164...
* Connected to www.nytimes.com (151.101.25.164) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: nytimes.com
* Server certificate: COMODO RSA Organization Validation Secure Server CA
* Server certificate: COMODO RSA Certification Authority
* Server certificate: AddTrust External CA Root
> GET /2017/03/29/us/politics/senate-intelligence-committee-burr-warner-russia-investigation.html HTTP/1.1
> Host: www.nytimes.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 303 See Other
< Server: Varnish
< Retry-After: 0
< Content-Length: 0
< Location: https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F03%2F29%2Fus%2Fpolitics%2Fsenate-intelligence-committee-burr-warner-russia-investigation.html%3F_r%3D0
< Accept-Ranges: bytes
< Date: Wed, 29 Mar 2017 22:25:24 GMT
< X-Frame-Options: DENY
< Set-Cookie: nyt-a=3916f8c6471bf6e1447af0833e4a599a663ff20969239265f8f5aaa3fd7ad255; Expires=Thu, 29 Mar 2018 22:25:24 GMT; Path=/; Domain=.nytimes.com
< Connection: close
< X-API-Version: F-0
< X-PageType: article
< Content-Security-Policy: default-src data: 'unsafe-inline' 'unsafe-eval' https:; script-src data: 'unsafe-inline' 'unsafe-eval' https: blob:; style-src data: 'unsafe-inline' https:; img-src data: https: blob:; font-src data: https:; connect-src https: wss:; media-src https: blob:; object-src https:; child-src https: data: blob:; form-action https:; block-all-mixed-content;
< X-Served-By: cache-lax8634-LAX
< X-Cache: HIT
< X-Cache-Hits: 0
< X-Timer: S1490826324.077466,VS0,VE0
<
* Closing connection 0

which implies i should follow the provided nytimes.com/glogin url, so if i curl that, i get

$ curl -v https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F03%2F29%2Fus%2Fpolitics%2Fsenate-intelligence-committee-burr-warner-russia-investigation.html%3F_r%3D0

*   Trying 151.101.25.164...
* Connected to www.nytimes.com (151.101.25.164) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: nytimes.com
* Server certificate: COMODO RSA Organization Validation Secure Server CA
* Server certificate: COMODO RSA Certification Authority
* Server certificate: AddTrust External CA Root
> GET /glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F03%2F29%2Fus%2Fpolitics%2Fsenate-intelligence-committee-burr-warner-russia-investigation.html%3F_r%3D0 HTTP/1.1
> Host: www.nytimes.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 302 Found
< Server: Apache
< Set-Cookie: NYT-S=0MlYPbVt86VMLDXrmvxADeHyIs.xRJwQzddeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI; expires=Fri, 28-Apr-2017 22:25:35 GMT; path=/; domain=.nytimes.com
< Set-Cookie: NYT-BCET=1493418335%7CUD9ePFkvdXpfhpUrb1QP%2FzMtX0g%3D%7CN%3B_%7CDoPsx4k67W1RHGHAV3rEK3dJPGSAeQmPWIjVFZUFZ%2Fg%3D; expires=Mon, 25-Sep-2017 22:25:35 GMT; path=/; domain=.nytimes.com; httponly
< Location: https://www.nytimes.com/2017/03/29/us/politics/senate-intelligence-committee-burr-warner-russia-investigation.html?_r=0
< Content-Type: text/html; charset=UTF-8
< X-Origin-Time: 2017-03-29 18:25:35 EDT
< Content-Length: 0
< Accept-Ranges: bytes
< Date: Wed, 29 Mar 2017 22:25:35 GMT
< X-Frame-Options: DENY
< Connection: close
< X-API-Version: F-X
< X-PageType: legacy
< Content-Security-Policy: default-src data: 'unsafe-inline' 'unsafe-eval' https:; script-src data: 'unsafe-inline' 'unsafe-eval' https: blob:; style-src data: 'unsafe-inline' https:; img-src data: https: blob:; font-src data: https:; connect-src https: wss:; media-src https: blob:; object-src https:; child-src https: data: blob:; form-action https:; block-all-mixed-content;
< X-Served-By: cache-lax8621-LAX
< X-Cache: MISS
< X-Cache-Hits: 0
< X-Timer: S1490826335.186540,VS0,VE38
< Vary: Fastly-SSL
<
* Closing connection 0

it leads me back to the original url, so i try curling that, but the results are the same as the first curl returned. this seems like the redirect loop that link_thumbnailer is getting caught in. from reading previous issues here, a missing cookie might be the problem, but i'm not familiar enough to know if that's true, and then how to fix it.

is this a bug you are already addressing? and is there any work arounds i can use now to get it working?

thanks!

gottfrois commented 7 years ago

this issue should be fixed on latest gem version (superior to 2.5.1 at least). It was an issue where the site was telling the agent to write a cookie but the gem was not following this directive. I would need to double check that this mechanism still works as expected

clairity commented 7 years ago

thanks!

i was using the latest v.3.3.0 and even tried tracking master directly in my Gemfile, but got the same results. i'm also on the latest ruby 2.4.1 and rails 5.1.0.rc1 if that matters.

i just tried downgrading to ruby 2.3.3 and rails 5.0.2, but that didn't help either.