psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
51.73k stars 9.26k forks source link

Incomplete HTML Request #2628

Closed Blizz8975 closed 9 years ago

Blizz8975 commented 9 years ago

Hey guys! I'm currently working on some content migration and I can't seem to pull the entire html source code using Python.

Here is the on of the pages I'm working on: http://holytrinityhs.echalk.com/site_res_view_photoalbum.aspx?resourceId=0b744865-ad8a-4e76-8d42-15966cd7c4e2

So by using: html = requests.get("http://holytrinityhs.echalk.com/site_res_view_photoalbum.aspx?resourceId=0b744865-ad8a-4e76-8d42-15966cd7c4e2") and the calling: html.text gives me

'<HTML><HEAD><TITLE>Holy Trinity Diocesan High School</TITLE></HEAD><BODY>Holy Trinity Diocesan High School<BR>98 Cherry Lane<BR>Hicksville, NY 11801<BR><BR>516-433-2900</BODY></HTML><!--\r\nWeb Server: W04ECNJ\r\nTotal Time: 31.2492ms\r\nCache Key: site_res_view_photoalbum_aspx_Down_en_2107878629\r\n//-->           \r\n'

which is not the full html source code.

Any help would be very much appreciated!

Thanks!

Lukasa commented 9 years ago

Hmm, I can't test this because I keep getting 504s. Are you sure they sent a complete response?

Blizz8975 commented 9 years ago

I'm not exactly sure what you mean by a complete response, can you tell me how can I verify this? :)

Lukasa commented 9 years ago

Yeah, that's a bit tricky to verify. It would help to see the response headers if you can print them out. That way I can check whether this is prone to truncated responses, at the very least.

Blizz8975 commented 9 years ago

Does this help? import urllib3 http = urllib3.PoolManager() r = http.request('GET', 'http://example.com/') r.headers['server'] ==> 'ECS (mdw/1275)'

Blizz8975 commented 9 years ago

My site gives this: import urllib3 http = urllib3.PoolManager() r = http.request('GET', 'http://holytrinityhs.echalk.com/site_res_view_photoalbum.aspx?resourceId=78224c68-7155-4b2e-999c-cc9abf549f2b') r.status 200 r.headers['server'] 'Microsoft-IIS/6.0'

Lukasa commented 9 years ago

Sorry, I'd like to see all the headers.

Blizz8975 commented 9 years ago

How about this? (from urllib3) HTTPHeaderDict({'Server': 'Microsoft-IIS/6.0', 'X-Powered-By': 'ASP.NET', 'Date': 'Wed, 03 Jun 2015 16:10:14 GMT', 'X-AspNet-Version': '4.0.30319', 'PICS-Label': '(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0)), (PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (v 0 s 0 n 0 l 0)), (PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (v 0 s 0 n 0 l 0))(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (l 0 s 0 v 0 o 0))', 'Cache-Control': 'private', 'Content-Type': 'text/html; charset=Windows-1252', 'Content-Length': '304', 'Set-Cookie': 'WebHostServer=W09ECNJ; path=/'})

Blizz8975 commented 9 years ago

This is the header I get from using requests: {'cache-control': 'private', 'x-aspnet-version': '4.0.30319', 'set-cookie': 'WebHostServer=W07ECNJ; path=/', 'date': 'Wed, 03 Jun 2015 16:16:16 GMT', 'x-powered-by': 'ASP.NET', 'content-type': 'text/html; charset=Windows-1252', 'content-encoding': 'gzip', 'content-length': '370', 'server': 'Microsoft-IIS/6.0', 'pics-label': '(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0)), (PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (v 0 s 0 n 0 l 0)), (PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (v 0 s 0 n 0 l 0))(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (l 0 s 0 v 0 o 0))'}

Lukasa commented 9 years ago

The cleaned up version:

{'Cache-Control': 'private',
 'Content-Length': '304',
 'Content-Type': 'text/html; charset=Windows-1252',
 'Date': 'Wed, 03 Jun 2015 16:10:14 GMT',
 'PICS-Label': '(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0)), (PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (v 0 s 0 n 0 l 0)), (PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (v 0 s 0 n 0 l 0))(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l by "support@echalk.com" on "2005.04.14T14:34-0400" exp "2008.04.18T12:00-0400" r (l 0 s 0 v 0 o 0))',
 'Server': 'Microsoft-IIS/6.0',
 'Set-Cookie': 'WebHostServer=W09ECNJ; path=/',
 'X-AspNet-Version': '4.0.30319',
 'X-Powered-By': 'ASP.NET'}

So the content-length header there is 304 bytes. That seems about right, so we haven't missed any HTML. It suggests that you're not making quite the same request your browser is. Do you know how to use your browser development tools?

Blizz8975 commented 9 years ago

I think so, the entire html should give this: " Holy Trinity Diocesan High School - Art Show

<!--[if lt IE 8]>

<![endif]-->

``` ``` ```
```
```
```
```

Art Show

```

```
```
```
```
```
```

Original text


``` " ```
Lukasa commented 9 years ago

Sorry, what I want you to do is use your developer tools to see what web request your browser is making. I suspect you need some cookies you don't have.

Blizz8975 commented 9 years ago

My bad :) Do you mean the information under cookies in the resources tab?

Blizz8975 commented 9 years ago

Nvm found the problem

Thanks for everthing!