rkd77 / elinks

Fork of elinks
Other
349 stars 38 forks source link

URIs with non-ASCII characters aren't handled correctly #221

Closed aelmahmoudy closed 1 year ago

aelmahmoudy commented 1 year ago

For the following HTML file:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Test of Accented Characters in URLs (ISO-8859-1 Encoding)</title>
</head>

<body>
<h1>Test of Accented Characters in URLs (ISO-8859-1 Encoding)</h1>

<ul>
<li><a href="http://localhost/1�.html">http://localhost/1�.html</a></li>
<li><a href="http://localhost/2&#xe9;.html">http://localhost/2&#xe9;.html</a></li>
<li><a href="http://localhost/3%C3%A9.html">http://localhost/3%C3%A9.html</a></li>
</ul>
</body>
</html>

the first two URIs are not handled correctly: the "è" character is replaced by %E8 instead of %C3%A8 :

::1 - - [20/Apr/2023:05:18:23 +0200] "GET /1\xe9.html HTTP/1.1" 404 488 "-" "ELinks/0.13.2 (textmode; Linux 6.1.0-7-amd64 x86_64; 96x60-2)"
::1 - - [20/Apr/2023:05:18:25 +0200] "GET /2\xe9.html HTTP/1.1" 404 487 "-" "ELinks/0.13.2 (textmode; Linux 6.1.0-7-amd64 x86_64; 96x60-2)"
::1 - - [20/Apr/2023:05:18:27 +0200] "GET /3%C3%A9.html HTTP/1.1" 404 487 "-" "ELinks/0.13.2 (textmode; Linux 6.1.0-7-amd64 x86_64; 96x60-2)"

As you can see, the first two links are sent incorrectly by ELinks.

As a comparison, here's what lynx gives:

::1 - - [20/Apr/2023:05:21:02 +0200] "GET /1%C3%A9.html HTTP/1.0" 404 451 "-" "Lynx/2.9.0dev.12 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/3.7.8"
::1 - - [20/Apr/2023:05:21:07 +0200] "GET /2%C3%A9.html HTTP/1.0" 404 451 "-" "Lynx/2.9.0dev.12 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/3.7.8"
::1 - - [20/Apr/2023:05:21:12 +0200] "GET /3%C3%A9.html HTTP/1.0" 404 451 "-" "Lynx/2.9.0dev.12 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/3.7.8"

The é is encoded as %C3%A9 in the 3 cases, which is correct.

Ditto with Firefox:

127.0.0.1 - - [20/Apr/2023:05:25:35 +0200] "GET /1%C3%A9.html HTTP/1.1" 404 488 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"
127.0.0.1 - - [20/Apr/2023:05:25:38 +0200] "GET /2%C3%A9.html HTTP/1.1" 404 487 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0"
127.0.0.1 - - [20/Apr/2023:05:25:41 +0200] "
rkd77 commented 1 year ago

Now, only for A elements, href is encoded to terminal codepage, which is usually utf-8 nowadays. At least for above testcase works as other browsers.

balducci commented 1 year ago

hello

commit 7a665d8de49236bf946efb25277bcd3612fc6242 breaks all my smart rewrite rules.

Eg, in elinks.conf I have:

set  protocol.rewrite.smart.dd  =  "https://duckduckgo.com/?t=ouk&q=%s"

If I type:

dd:felinks software

in the Go to URLwindow, I get an error message from the DuckDuckGo site:

                  Oops, there was an error.  Please try again.                  

                If it persists, please email ops@duckduckgo.com                 

OTOH, I get the usual correct behavior with any version prior to the breaking commit mentioned above.

I don't know if this is a problem introduced by 7a665d8de49236bf946efb25277bcd3612fc6242, or if I should change something in my smart rewrite rules to make them work again.

many thanks in advance for any help -gabriele

rkd77 commented 1 year ago

And commit from #226 ?

balducci commented 1 year ago

And commit from #226 ?

ah, that apparently fixed everything! I thank you very much indeed