Closed CMCDragonkai closed 10 years ago
What is the version of SlimerJS you are using ? Which edition ? it this is the standalone edition, what is the version of gecko/firefox you are using ? (you can use --debug=true to see all this informations) . Give me these informations for each platform
For Windows it's: SlimerJS, Version=0.9.0rc1, BuildID=20131104
For Ubuntu it's: Version=0.9.0 BuildID=20131211
Both are standalone version. I downloaded them from http://download.slimerjs.org/v0.9/
But on Ubuntu, remember how I had to install firefox in case it didn't work. FF on Windows version is 26, but the max version for SlimerJS is:
[App]
Vendor=Innophi
Name=SlimerJS
Version=0.9.0rc1
BuildID=20131104
ID=slimerjs@slimerjs.org
Copyright=Copyright 2012-2013 Laurent Jouanneau & Innophi
[Gecko]
MinVersion=17.0.0
MaxVersion=25.*
I upgraded my Windows version to 0.9.0. The output has now changed to be exactly the same as the Ubuntu version.
This however is still a problem since neither Windows nor Ubuntu shows the headers for redirected URLs, only the final resolved one.
Basically both Windows and Ubuntu are now like this:
1
301
[]
2
302
[]
3
200
[{"name":"Date","value":"Tue, 21 Jan 2014 12:18:41 GMT"},{"name":"Expires","value":"-1"},{"name":"Cache-Control","value":"private, max-age=0"},{"name":"Content-Type","value":"text/html; charset=ISO-8859-1"},{"name":"Set-Cookie","value":"PREF=ID=43c2671276bd1e1a:FF=0:TM=1390306721:LM=1390306721:S=OccCjFtSX7fRx37w; expires=Thu, 21-Jan-2016 12:18:41 GMT; path=/; domain=.google.com.au"},{"name":"Set-Cookie","value":"NID=67=HyB2ghFuIWv5dLYnvYOMiTeKGSfkOrh-Z8JDZaxuvax6xnEjqvrthLHy9-PlYXH7e69ucRywJTuVq0ORocHn5GU6DRI0sxWpIEYPz_2fsumLlT7GzUCrnDyvdlgpVwDE; expires=Wed, 23-Jul-2014 12:18:41 GMT; path=/; domain=.google.com.au; HttpOnly"},{"name":"P3P","value":"CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""},{"name":"Server","value":"gws"},{"name":"X-XSS-Protection","value":"1; mode=block"},{"name":"X-Frame-Options","value":"SAMEORIGIN"},{"name":"Alternate-Protocol","value":"80:quic"},{"name":"Transfer-Encoding","value":"chunked"}]
I also logged the resource object during redirects, and this is what you get:
{
"id":1,
"url":"http://google.com/",
"time":"2014-01-25T05:32:56.353Z",
"headers":[],
"bodySize":0,
"contentType":null,
"contentCharset":"UTF-8",
"redirectURL":null,
"stage":"end",
"status":301,
"statusText":"Moved Permanently",
"referrer":"",
"body":""
}
{
"id":2,
"url":"http://www.google.com/",
"time":"2014-01-25T05:32:56.353Z",
"headers":[],
"bodySize":0,
"contentType":null,
"contentCharset":"UTF-8",
"redirectURL":null,
"stage":"end",
"status":302,
"statusText":"Found",
"referrer":"",
"body":""
}
There's no headers, no body, no redirectUrl for any status code that represents a redirect.
This would be the headers and content that comes out of a request to http://google.com using plain curl:
HTTP/1.1 301 Moved Permanently
Alternate-Protocol: 80:quic
Cache-Control: public, max-age=2592000
Content-Length: 219
Content-Type: text/html; charset=UTF-8
Date: Sat, 25 Jan 2014 06:20:59 GMT
Expires: Mon, 24 Feb 2014 06:20:59 GMT
Location: http://www.google.com/
Server: gws
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
In the documentation of onResourceReceived it says:
Note about the ``body`` property: by default, the body property is filled only for the resource that corresponds to the main html page. For other resources, it will be empty.
If you want to have it filled for resources used in the page, you have to indicate their content type into captureContent property. This limitation exists to avoid to take memory uselessly (in the case where you don’t need the body property), since resources like images or videos could take many memory.
I understand that the body content is not filled for other resources other than the "main" html page. However it's possible to change that by changing the regular expressions in captureContent.
However there is no way to make that capture the content of redirected resources. Is there any way to get the content of the redirected resource?
This doesn't work, the redirected responses still have an empty body.
page.captureContent = [ /.*/ ];
There seems to be some discussion but no further work here: https://github.com/ariya/phantomjs/issues/10158 and https://github.com/laurentj/slimerjs/issues/110
Allowing the capture of HTTP headers and Body content (just the plaintext) on redirected responses, would allow me to close https://github.com/laurentj/slimerjs/issues/127 as this functionality would preclude the need for navigationLocked to work.
Looks like this will be important for some stuff I and @seiflotfy is working on
@hallvors Can you elaborate on your solution?
We have a script for site compatibility testing, one of the features is that it detects backend browser sniffing by comparing redirects the server generates for different User-Agent headers. So we need to track what redirects the browser is told to follow.
Did you fix SlimerJS core or did a workaround?
Neither - the script currently runs with a GTKWebKit backend :-1:, but we want to port it to SlimerJS.
@laurentj Has this been resolved? I resorted to having to detect that if the scrape was a redirection, I needed to directly curl the URL to attain the correct content and headers.
@CMCDragonkai Yes, The original issue is resolved (did you see the commit above with unit tests?). And I don't understand what you try to do. If there is a new issue, please open a new issue.
Oh that commit looks like it gets the headers but I also want the body of the http response during a redirection. Is it possible?
While working with onResourceRecieved in both Windows and Linux, I discovered this issue.
Basically if you use onResourceReceived to get the headers and status code of the resource, there's a discrepancy between Linux and Windows when the resource being retrieved redirects you to another resource.
For example requesting http://google.com redirects you to http://www.google.com then to http://www.google.com?queryparam....
Once the redirection is resolved, the status code of that resource is obviously 200. You can try this by using the HTTPie client and just requesting http://google.com and then following the redirections.
Using onResourceReceived is the main way to get the status code and headers inside SlimerJS.
I was logging out the headers and status code inside onResourceReceived. Each resource object has an ID.
For the 3 redirects: http://google.com http://www.google.com http://www.google.com/?queryparm...
On Windows, it shows:
On Ubuntu it shows:
As you can see:
I think both implementations are buggy.
On one hand, Windows does not show the "truth" regarding the headers and status code.
On the other hand, Ubuntu will not give you the headers of requests being redirected.
I think in both cases, the onResourceReceived should show the correct and relevant headers and status code for each resource that is being requested.