laurentj / slimerjs

A scriptable browser like PhantomJS, based on Firefox
http://slimerjs.org
Other
3k stars 258 forks source link

Extraction of headers and status code using onResourceReceived #144

Closed CMCDragonkai closed 10 years ago

CMCDragonkai commented 10 years ago

While working with onResourceRecieved in both Windows and Linux, I discovered this issue.

Basically if you use onResourceReceived to get the headers and status code of the resource, there's a discrepancy between Linux and Windows when the resource being retrieved redirects you to another resource.

For example requesting http://google.com redirects you to http://www.google.com then to http://www.google.com?queryparam....

Once the redirection is resolved, the status code of that resource is obviously 200. You can try this by using the HTTPie client and just requesting http://google.com and then following the redirections.

Using onResourceReceived is the main way to get the status code and headers inside SlimerJS.

I was logging out the headers and status code inside onResourceReceived. Each resource object has an ID.

page.onResourceReceived(resource){
console.log(resource.id);
console.log(resource.status);
console.log(JSON.stringify(resource.headers));
}

For the 3 redirects: http://google.com http://www.google.com http://www.google.com/?queryparm...

On Windows, it shows:

1
200
[{"name":"Date","value":"Tue, 21 Jan 2014 12:14:39 GMT"},{"name":"Expires","value":"-1"},{"name":"Cache-Control","value":"private, max-age=0"},{"name":"Content-Type","value":"text/html; charset=ISO-8859-1"},{"name":"Set-Cookie","value":"PREF=ID=32b5fac854df7f9d:FF=0:TM=1390306479:LM=1390306479:S=Hp6SNbeFqx-EInap; expires=Thu, 21-Jan-2016 12:14:39 GMT; path=/; domain=.google.com.au"},{"name":"Set-Cookie","value":"NID=67=LvjZbLCWTSAuRqXDIbxvsnlx3YW82ehp0J1OSfGSHcwlr5r3lufTmSnqlGUZADZKx3ELJr-FWynx0TvGCumnMPQqMHqsoQe76fpHbIIjLmqVN6pQFnldg_LoR4FZXHgF; expires=Wed, 23-Jul-2014 12:14:39 GMT; path=/; domain=.google.com.au; HttpOnly"},{"name":"P3P","value":"CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""},{"name":"Server","value":"gws"},{"name":"X-XSS-Protection","value":"1; mode=block"},{"name":"X-Frame-Options","value":"SAMEORIGIN"},{"name":"Alternate-Protocol","value":"80:quic"},{"name":"Transfer-Encoding","value":"chunked"}]

2
200
[{"name":"Date","value":"Tue, 21 Jan 2014 12:14:39 GMT"},{"name":"Expires","value":"-1"},{"name":"Cache-Control","value":"private, max-age=0"},{"name":"Content-Type","value":"text/html; charset=ISO-8859-1"},{"name":"Set-Cookie","value":"PREF=ID=32b5fac854df7f9d:FF=0:TM=1390306479:LM=1390306479:S=Hp6SNbeFqx-EInap; expires=Thu, 21-Jan-2016 12:14:39 GMT; path=/; domain=.google.com.au"},{"name":"Set-Cookie","value":"NID=67=LvjZbLCWTSAuRqXDIbxvsnlx3YW82ehp0J1OSfGSHcwlr5r3lufTmSnqlGUZADZKx3ELJr-FWynx0TvGCumnMPQqMHqsoQe76fpHbIIjLmqVN6pQFnldg_LoR4FZXHgF; expires=Wed, 23-Jul-2014 12:14:39 GMT; path=/; domain=.google.com.au; HttpOnly"},{"name":"P3P","value":"CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""},{"name":"Server","value":"gws"},{"name":"X-XSS-Protection","value":"1; mode=block"},{"name":"X-Frame-Options","value":"SAMEORIGIN"},{"name":"Alternate-Protocol","value":"80:quic"},{"name":"Transfer-Encoding","value":"chunked"}]

3
200
[{"name":"Date","value":"Tue, 21 Jan 2014 12:14:39 GMT"},{"name":"Expires","value":"-1"},{"name":"Cache-Control","value":"private, max-age=0"},{"name":"Content-Type","value":"text/html; charset=ISO-8859-1"},{"name":"Set-Cookie","value":"PREF=ID=32b5fac854df7f9d:FF=0:TM=1390306479:LM=1390306479:S=Hp6SNbeFqx-EInap; expires=Thu, 21-Jan-2016 12:14:39 GMT; path=/; domain=.google.com.au"},{"name":"Set-Cookie","value":"NID=67=LvjZbLCWTSAuRqXDIbxvsnlx3YW82ehp0J1OSfGSHcwlr5r3lufTmSnqlGUZADZKx3ELJr-FWynx0TvGCumnMPQqMHqsoQe76fpHbIIjLmqVN6pQFnldg_LoR4FZXHgF; expires=Wed, 23-Jul-2014 12:14:39 GMT; path=/; domain=.google.com.au; HttpOnly"},{"name":"P3P","value":"CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""},{"name":"Server","value":"gws"},{"name":"X-XSS-Protection","value":"1; mode=block"},{"name":"X-Frame-Options","value":"SAMEORIGIN"},{"name":"Alternate-Protocol","value":"80:quic"},{"name":"Transfer-Encoding","value":"chunked"}]

On Ubuntu it shows:

1
301
[]

2
302
[]

3
200
[{"name":"Date","value":"Tue, 21 Jan 2014 12:18:41 GMT"},{"name":"Expires","value":"-1"},{"name":"Cache-Control","value":"private, max-age=0"},{"name":"Content-Type","value":"text/html; charset=ISO-8859-1"},{"name":"Set-Cookie","value":"PREF=ID=43c2671276bd1e1a:FF=0:TM=1390306721:LM=1390306721:S=OccCjFtSX7fRx37w; expires=Thu, 21-Jan-2016 12:18:41 GMT; path=/; domain=.google.com.au"},{"name":"Set-Cookie","value":"NID=67=HyB2ghFuIWv5dLYnvYOMiTeKGSfkOrh-Z8JDZaxuvax6xnEjqvrthLHy9-PlYXH7e69ucRywJTuVq0ORocHn5GU6DRI0sxWpIEYPz_2fsumLlT7GzUCrnDyvdlgpVwDE; expires=Wed, 23-Jul-2014 12:18:41 GMT; path=/; domain=.google.com.au; HttpOnly"},{"name":"P3P","value":"CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""},{"name":"Server","value":"gws"},{"name":"X-XSS-Protection","value":"1; mode=block"},{"name":"X-Frame-Options","value":"SAMEORIGIN"},{"name":"Alternate-Protocol","value":"80:quic"},{"name":"Transfer-Encoding","value":"chunked"}]

As you can see:

  1. Windows uses the last resolved redirected headers and status code for all 3 requests.
  2. Ubuntu does not extract any headers for the first 2 requests.
  3. Ubuntu correctly extracts the status code for all 3 requests.
  4. Ubuntu only extracts both the headers and status code for the last request.

I think both implementations are buggy.

On one hand, Windows does not show the "truth" regarding the headers and status code.

On the other hand, Ubuntu will not give you the headers of requests being redirected.

I think in both cases, the onResourceReceived should show the correct and relevant headers and status code for each resource that is being requested.

laurentj commented 10 years ago

What is the version of SlimerJS you are using ? Which edition ? it this is the standalone edition, what is the version of gecko/firefox you are using ? (you can use --debug=true to see all this informations) . Give me these informations for each platform

CMCDragonkai commented 10 years ago

For Windows it's: SlimerJS, Version=0.9.0rc1, BuildID=20131104

For Ubuntu it's: Version=0.9.0 BuildID=20131211

Both are standalone version. I downloaded them from http://download.slimerjs.org/v0.9/

But on Ubuntu, remember how I had to install firefox in case it didn't work. FF on Windows version is 26, but the max version for SlimerJS is:

[App]
Vendor=Innophi
Name=SlimerJS
Version=0.9.0rc1
BuildID=20131104
ID=slimerjs@slimerjs.org
Copyright=Copyright 2012-2013 Laurent Jouanneau & Innophi

[Gecko]
MinVersion=17.0.0
MaxVersion=25.*
CMCDragonkai commented 10 years ago

I upgraded my Windows version to 0.9.0. The output has now changed to be exactly the same as the Ubuntu version.

This however is still a problem since neither Windows nor Ubuntu shows the headers for redirected URLs, only the final resolved one.

Basically both Windows and Ubuntu are now like this:

1
301
[]

2
302
[]

3
200
[{"name":"Date","value":"Tue, 21 Jan 2014 12:18:41 GMT"},{"name":"Expires","value":"-1"},{"name":"Cache-Control","value":"private, max-age=0"},{"name":"Content-Type","value":"text/html; charset=ISO-8859-1"},{"name":"Set-Cookie","value":"PREF=ID=43c2671276bd1e1a:FF=0:TM=1390306721:LM=1390306721:S=OccCjFtSX7fRx37w; expires=Thu, 21-Jan-2016 12:18:41 GMT; path=/; domain=.google.com.au"},{"name":"Set-Cookie","value":"NID=67=HyB2ghFuIWv5dLYnvYOMiTeKGSfkOrh-Z8JDZaxuvax6xnEjqvrthLHy9-PlYXH7e69ucRywJTuVq0ORocHn5GU6DRI0sxWpIEYPz_2fsumLlT7GzUCrnDyvdlgpVwDE; expires=Wed, 23-Jul-2014 12:18:41 GMT; path=/; domain=.google.com.au; HttpOnly"},{"name":"P3P","value":"CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""},{"name":"Server","value":"gws"},{"name":"X-XSS-Protection","value":"1; mode=block"},{"name":"X-Frame-Options","value":"SAMEORIGIN"},{"name":"Alternate-Protocol","value":"80:quic"},{"name":"Transfer-Encoding","value":"chunked"}]

I also logged the resource object during redirects, and this is what you get:

{
    "id":1, 
    "url":"http://google.com/", 
    "time":"2014-01-25T05:32:56.353Z", 
    "headers":[], 
    "bodySize":0, 
    "contentType":null, 
    "contentCharset":"UTF-8", 
    "redirectURL":null, 
    "stage":"end", 
    "status":301, 
    "statusText":"Moved Permanently", 
    "referrer":"", 
    "body":""
}

{
    "id":2, 
    "url":"http://www.google.com/", 
    "time":"2014-01-25T05:32:56.353Z", 
    "headers":[], 
    "bodySize":0, 
    "contentType":null, 
    "contentCharset":"UTF-8", 
    "redirectURL":null, 
    "stage":"end", 
    "status":302, 
    "statusText":"Found", 
    "referrer":"", 
    "body":""
}

There's no headers, no body, no redirectUrl for any status code that represents a redirect.

This would be the headers and content that comes out of a request to http://google.com using plain curl:

HTTP/1.1 301 Moved Permanently
Alternate-Protocol: 80:quic
Cache-Control: public, max-age=2592000
Content-Length: 219
Content-Type: text/html; charset=UTF-8
Date: Sat, 25 Jan 2014 06:20:59 GMT
Expires: Mon, 24 Feb 2014 06:20:59 GMT
Location: http://www.google.com/
Server: gws
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
CMCDragonkai commented 10 years ago

In the documentation of onResourceReceived it says:

Note about the ``body`` property: by default, the body property is filled only for the resource that corresponds to the main html page. For other resources, it will be empty.

If you want to have it filled for resources used in the page, you have to indicate their content type into captureContent property. This limitation exists to avoid to take memory uselessly (in the case where you don’t need the body property), since resources like images or videos could take many memory.

I understand that the body content is not filled for other resources other than the "main" html page. However it's possible to change that by changing the regular expressions in captureContent.

However there is no way to make that capture the content of redirected resources. Is there any way to get the content of the redirected resource?

This doesn't work, the redirected responses still have an empty body.

    page.captureContent = [ /.*/ ];

There seems to be some discussion but no further work here: https://github.com/ariya/phantomjs/issues/10158 and https://github.com/laurentj/slimerjs/issues/110

Allowing the capture of HTTP headers and Body content (just the plaintext) on redirected responses, would allow me to close https://github.com/laurentj/slimerjs/issues/127 as this functionality would preclude the need for navigationLocked to work.

hallvors commented 10 years ago

Looks like this will be important for some stuff I and @seiflotfy is working on

CMCDragonkai commented 10 years ago

@hallvors Can you elaborate on your solution?

hallvors commented 10 years ago

We have a script for site compatibility testing, one of the features is that it detects backend browser sniffing by comparing redirects the server generates for different User-Agent headers. So we need to track what redirects the browser is told to follow.

CMCDragonkai commented 10 years ago

Did you fix SlimerJS core or did a workaround?

hallvors commented 10 years ago

Neither - the script currently runs with a GTKWebKit backend :-1:, but we want to port it to SlimerJS.

CMCDragonkai commented 9 years ago

@laurentj Has this been resolved? I resorted to having to detect that if the scrape was a redirection, I needed to directly curl the URL to attain the correct content and headers.

laurentj commented 9 years ago

@CMCDragonkai Yes, The original issue is resolved (did you see the commit above with unit tests?). And I don't understand what you try to do. If there is a new issue, please open a new issue.

CMCDragonkai commented 9 years ago

Oh that commit looks like it gets the headers but I also want the body of the http response during a redirection. Is it possible?