jumaris / indyproject

Automatically exported from code.google.com/p/indyproject
0 stars 0 forks source link

TIdHTTP.Get method doesn't accept new redirect location resulting 404 exception #249

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Indy10 embedded in Delphi XE3

TIdHTTP.Get method when redirects, doesn't accept newly received location with 
ANSI characters inside (due to conversion errors between binary AnsiString and 
unicode String, resulting "?" characters inside location and finally giving 404 
ResponseCode, where finally should be 200)

Using method with given url: 
http://www.email-brokers.com/cz/b2c-a-b2b/profily-b2c probably will finally end 
with 404 response, but when you'll try open this url in Chrome, Firefox, IE 
then you'll be finally redirected correctly to 
http://www.email-brokers.com/cz/emailov%C3%BD-seznam-B2C with result 200 ... 
Yes... Those "%C3%BD" changing, in TIdHTTP.Get method, to "??" characters 
resulting bad new location url and finally throwing 404 error where it 
shouldn't.

Original issue reported on code.google.com by gambit47 on 5 Apr 2013 at 11:52

GoogleCodeExporter commented 9 years ago
The problem is on the HTTP server's end, not in TIdHTTP.

The server's "Location" reply header contains an illegally formatted URL.  HTTP 
headers are not allowed to contain non-ASCII characters in them, and the "ý" 
character is not allowed to be unencoded in a URL anyway.  It just happens that 
the web browsers are allowing the illegal character, interpretting it as-is in 
the local OS charset, converting it to UTF-8 (which in of itself may result in 
a bad conversion depending on what the local OS charset actually is), and then 
url-encoding the UTF-8 data when sending it back to the HTTP server.  But that 
behavior is NOT PART OF THE HTTP SPEC!  The "Location" header is meant to be 
used as-is when redirecting.  The browsers are just doing extra work that 
TIdHTTP is not.

The reason why this particular server fails in TIdHTTP is because TIdHTTP reads 
HTTP headers as 7-bit ASCII, as it should per RFCs 822 and 2616, which define 
HTTP headers as ASCII only, thus the "ý" character gets lost before TIdHTTP 
even sees it.

This is not a bug in TIdHTTP.  It is a bug in the HTTP server sending a 
malformed Location in reply to the original GET request.

To mimic what the web browsers are doing, you can set TIdHTTP's 
DefStringEncoding property to Indy8BitEncoding, then use the TIdHTTP.OnRedirect 
event to manually decode an re-encode the URL that TIdHTTP then redirects to.

Original comment by gambit47 on 6 Apr 2013 at 12:12

GoogleCodeExporter commented 9 years ago
Also, non-ASCII characters are not allowed in URLs per RFC 2396.

Original comment by gambit47 on 6 Apr 2013 at 12:20