rkd77 / elinks

Fork of elinks
Other
313 stars 34 forks source link

S with caron rendered as tofu by elinks, rendered correctly by links #249

Open 0-issue opened 11 months ago

0-issue commented 11 months ago

S with caron (Š) rendered as tofu by elinks, rendered correctly by links

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<head>
<title>Test UTF-8</title>
</head>
<body>
<p class="indent">Jaro Šnajdrov</p>
</body>
</html>

links output:

Screenshot 2023-07-23 at 7 41 26 PM

elinks output:

Screenshot 2023-07-23 at 7 42 14 PM
0-issue commented 11 months ago

Similar problem with unicode non breaking space... it is rendered as tofu by elinks, and not by links.

% printf "a 8" | xxd
00000000: 61c2 a038                                a..8
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<head>
<title>Test UTF-8 NBSP</title>
</head>
<body>
<p class="indent">Chapter 8</p>
</body>
</html>

links output:

Screenshot 2023-07-23 at 9 04 35 PM

elinks output:

Screenshot 2023-07-23 at 9 04 21 PM
rkd77 commented 11 months ago

Did you compile elinks with utf-8 enabled? -Dutf-8=true

0-issue commented 11 months ago

@rkd77 I have never used meson, so am not sure if the build options inmeson_options.txt are picked up by the script. Here's are the steps I followed:

./autogen.sh
./configure
make
sudo make install

I am not sure if meson_options.txt is picked up by this flow. I would guess not? I couldn't find any installation instructions using meson in your install instructions file INSTALL. Excerpt from ./configure. As I can see it did not pick up true field for 256-colors/true-color options from meson_options.txt. Are there 2 different build flows here? If yes, how can I pass those options to configure? (it doesn't like options like -Dutf-8=true).

The following feature summary has been saved to features.log
Feature summary:
Documentation Tools ............. AsciiDoc, XmlTo, Pod2HTML
Manual Formats .................. HTML (one file), HTML (multiple files)
Man Page Formats ................ HTML, man (groff)
API Documentation ............... no
gpm ............................. no
terminfo ........................ no
zlib ............................ yes
bzlib ........................... yes
zstd ............................ no
brotli .......................... no
lzma ............................ no
idn2 ............................ no
Bookmarks ....................... yes
XBEL bookmarks .................. yes
ECMAScript (JavaScript) ......... no
Browser scripting ............... no
libev ........................... no
libevent ........................ no
SSL ............................. GNUTLS
Native Language Support ......... yes
System gettext .................. no
Cookies ......................... yes
Form history .................... yes
Global history .................. yes
Mailcap ......................... yes
Mimetypes files ................. yes
IPv6 ............................ yes
BitTorrent protocol ............. no
Data protocol ................... yes
URI rewriting ................... yes
Local CGI ....................... no
DOS Gateway Interface ........... no
Finger protocol ................. no
FSP protocol .................... no
FTP protocol .................... yes
Gemini protocol ................. no
Gopher protocol ................. no
NNTP protocol ................... no
Samba protocol .................. no
Mouse handling .................. yes
BSD sysmouse .................... no
88 colors ....................... no
256 colors ...................... no
true color ...................... no
Exmode interface ................ no
LEDs ............................ yes
Marks ........................... yes
Cascading Style Sheets .......... yes
HTML highlighting ............... no
DOM engine ...................... no
Backtrace ....................... yes
No root exec .................... no
Debug mode ...................... no
Fast mode ....................... no
Own libc stubs .................. no
Small binary .................... no
UTF-8 ........................... yes
Combining characters ............ no
Reproducible builds ............. no
Check codepoints ................ no
Regexp searching ................ no (TRE not found)
rkd77 commented 11 months ago

Here is a simple build script for meson:


rm -rf /dev/shm/builddir

meson setup /dev/shm/builddir \
-D88-colors=false \
-D256-colors=true \
-Dapidoc=false \
-Dpdfdoc=false
...
and so on

meson compile -C /dev/shm/builddir

and cd /dev/shm/builddir && ninja install

Seems configure script also built binary with utf-8 support. What is your locale LANG, LC_ALL ? Which terminal? Which distribution?

On Debian 12, konsole and LANG=pl_PL.UTF-8 is displayed fine.

0-issue commented 11 months ago

@rkd77 On macOS aarch64. macOS don't have pl_PL.UTF-8, it is en_US.UTF-8.

% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Terminal: tested on multiple: iTerm2, Alacritty, WezTerm, Kitty. They all have different fonts too... I wonder if it is the Combining characters feature? Meson config stage passed after disabling a whole bunch of options like gpm, libcss, etc. But then the compile stage fails with errors I don't understand. Am adding it the error information below for your reference. My experience with make has been way smoother (no errors). Will try manipulating options for it and get back.

% meson compile -C ~/.build/elinks
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /opt/homebrew/bin/ninja -C /Users/amanmehra/.build/elinks
ninja: Entering directory `/Users/amanmehra/.build/elinks'
[6/185] Compiling C object src/elinks.p/config_cmdline.c.o
FAILED: src/elinks.p/config_cmdline.c.o
cc -Isrc/elinks.p -Isrc -I../../packages/elinks/src -I. -I../../packages/elinks -I/opt/homebrew/Cellar/zlib/1.2.13/include -I/opt/homebrew/Cellar/tre/0.8.0/include -I/opt/homebrew/Cellar/openssl@3/3.1.1_1/include -I/opt/homebrew/Cellar/libidn2/2.3.4_1/include -I/opt/homebrew/opt/icu4c/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/ncurses/include -I/opt/homebrew/opt/ncurses/include/ncursesw -I/opt/homebrew/opt/gdk-pixbuf/include/gdk-pixbuf-2.0 -I/opt/homebrew/opt/zlib/include -fcolor-diagnostics -Wall -Winvalid-pch -O0 -g '-DGETTEXT_PACKAGE="elinks"' '-DBUILD_ID="c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty"' -DHAVE_CONFIG_H -fno-strict-aliasing -Wno-address -MD -MQ src/elinks.p/config_cmdline.c.o -MF src/elinks.p/config_cmdline.c.o.d -o src/elinks.p/config_cmdline.c.o -c ../../packages/elinks/src/config/cmdline.c
../../packages/elinks/src/config/cmdline.c:173:14: error: call to undeclared function 'idn2_to_ascii_lz'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                int code = idn2_to_ascii_lz(idname, &idname2, 0);
                           ^
../../packages/elinks/src/config/cmdline.c:175:15: error: use of undeclared identifier 'IDN2_OK'
                if (code == IDN2_OK) {
                            ^
2 errors generated.
[17/185] Compiling C object src/elinks.p/document_html_parser_forms.c.o
ninja: build stopped: subcommand failed.

UPDATE: configure with --enable-combining doesn't change anything.

0-issue commented 11 months ago

UPDATE: I installed elinks in an Arch VM and opened it in the same tmux session on macOS. The locale, font, terminal, tmux, terminfo is same, but it renders correctly in the Arch VM but not in macOS in adjacent pane of the same tmux session. For some reasons --version does not produce anything on macOS. Here's the --version output:

macOS (+/- Fastmem doesn't matter):

ELinks 0.17.GIT c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty
Built on Jul 24 2023 15:07:24

Features:
Standard, Fastmem, IPv6, gzip(1.2.13), bzip2(1.0.8), UTF-8, Periodic
Saving, Viewer (Search History, Timer, Marks), gettext (ELinks),
Cascading Style Sheets, Protocol (Authentication, File, FTP, HTTP, URI
rewrite, User protocols), SSL (GnuTLS), MIME (Option system, Mailcap,
Mimetypes files), LED indicators, Bookmarks, Cookies, Form History,
Global History, Goto URL History

Arch Linux:

ELinks 0.16.1.1
Built on Jul 24 2023 19:33:32

Features:
Standard, IPv6, gzip(1.2.13), bzip2(1.0.8), zstd(1.5.5), gpm(2.1.0),
UTF-8, Periodic Saving, Viewer (Search History, Timer, Marks), gettext
(ELinks), Cascading Style Sheets, Protocol (Authentication, File, CGI,
FTP, Gemini, HTTP, URI rewrite, User protocols), SSL (OpenSSL), MIME
(Option system, Mailcap, Mimetypes files), LED indicators, Bookmarks,
Cookies, Form History, Global History, Scripting (Lua), Goto URL History
rkd77 commented 11 months ago

@amanvm Could you confirm, that the same bug (wrong utf-8 letter) occurs on FreeBSD VM ? I have no access to such hardware, but I guess FreeBSD is similar to MacOS in this case.

0-issue commented 11 months ago

@rkd77 Just tested, it does not happen in FreeBSD VM! I mean it renders correctly in FreeBSD and Linux. Both tested in same tmux session on macOS with defaults (no config). I also tried this with default config (no config) on macOS, but the problem still persists. So it is not a config problem either... Mine is a aarch64 macOS machine, not sure if that affects anything. Searching "macos virtual machine on linux" shows a whole bunch of videos and guides...

0-issue commented 11 months ago

@rkd77 One observation: Unlike most other systems where libs/include files are in standard directories `/usr/localor/usr/, home-brew on aarch64 macOS recommends/opt/homebrew. I ran theotool -L(lddcommand's equivalent on macOS andldd`` on Linux to find that my macOS elinks version had a bunch of missing libs (it still has less). It didn't even link to libssl or libiconv. Updating the configure path for those libs does link it to the respective libs, but still things are the same. Can you eyeball the linked libraries to see if anything more is needed?

On macOS I used this:

./configure --with-openssl=/opt/homebrew/Cellar/openssl@3/3.1.1_1/ --withlibiconv=/opt/homebrew/Cellar/libiconv/1.17/

macOS (otool -L /path/to/elinks):

% otool -L /usr/local/bin/elinks
/usr/local/bin/elinks:
        /opt/homebrew/opt/tre/lib/libtre.5.dylib (compatibility version 6.0.0, current version 6.0.0)
        /opt/X11/lib/libX11.6.dylib (compatibility version 11.0.0, current version 11.0.0)
        /opt/homebrew/opt/openssl@3/lib/libssl.3.dylib (compatibility version 3.0.0, current version 3.0.0)
        /opt/homebrew/opt/openssl@3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
        /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.8)
        /opt/homebrew/opt/zlib/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.13)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
        /usr/lib/libexpat.1.dylib (compatibility version 7.0.0, current version 8.0.0)
        /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1500.65.0)

ArchLinux (ldd /path/to/elinks):

% ldd /usr/bin/elinks
        linux-vdso.so.1 (0x0000ffffaaf45000)
        libtre.so.5 => /usr/lib/libtre.so.5 (0x0000ffffaacf0000)
        libssl.so.3 => /usr/lib/libssl.so.3 (0x0000ffffaac20000)
        libcrypto.so.3 => /usr/lib/libcrypto.so.3 (0x0000ffffaa780000)
        liblua.so.5.4 => /usr/lib/liblua.so.5.4 (0x0000ffffaa720000)
        libidn.so.12 => /usr/lib/libidn.so.12 (0x0000ffffaa6d0000)
        libzstd.so.1 => /usr/lib/libzstd.so.1 (0x0000ffffaa600000)
        libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x0000ffffaa5d0000)
        libz.so.1 => /usr/lib/libz.so.1 (0x0000ffffaa5a0000)
        libgpm.so.2 => /usr/lib/libgpm.so.2 (0x0000ffffaa580000)
        libexpat.so.1 => /usr/lib/libexpat.so.1 (0x0000ffffaa540000)
        libc.so.6 => /usr/lib/libc.so.6 (0x0000ffffaa380000)
        /lib/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffffaaf0c000)
        libm.so.6 => /usr/lib/libm.so.6 (0x0000ffffaa2d0000)
        libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x0000ffffaa240000)
rkd77 commented 11 months ago

unsigned char -> char conversion in the past is suspected. There are not too many released tarballs, so checking them by one by one or by "bisection" could prove this hipotesis. If 0.13.0 fails, I have no idea.

0-issue commented 11 months ago

@rkd77 Tested on 0.13.0, its the same there! I wonder if there is any other unicode shaping library that you could use, or how links is able to display these glyphs.

rkd77 commented 11 months ago

What if charset is added in meta?


<!DOCTYPE html>
<head>
<meta charset="utf-8"/>
<title>Test UTF-8 NBSP</title>
</head>
<body>
<p class="indent">Jaro Šnajdrov</p>
</body>
</html>```

Also how looks like dump:
elinks -dump file.html
and
links -dump file.html
?
0-issue commented 11 months ago

@rkd77 For 1) I get an error "Bad url syntax" with elinks, but not with links.

For 2) (the -dump option): same problem. The output from links has correct S with caron, the dump from links has tofu there.

It's a little late here, so any other followup will be after a while. Thanks!

Screenshot 2023-07-27 at 12 43 11 AM
rkd77 commented 11 months ago

I guess it has something common with detection of encoding. If this ^ commit did not resolve it, I have no idea.

0-issue commented 11 months ago

@rkd77 Didn't resolve it. One comment I have is: a lot of unicode seems to just render fine. It's only a subset that doesn't. If you could think of a patch that does some kind of text log generation for interesting function arguments and ret values for a input test document like this, I can volunteer for that for sure.

rkd77 commented 11 months ago

@amanvm you can prepare test cases and save dumps (elinks --dump) . And show hex view of these dumps.

0-issue commented 11 months ago

@rkd77 Am not a unicode/utf-8 expert and we might end up doing a lot of back and forth that way. Don't you want to add a fprintf or two to some important functions that shapes/processes unicode data so there is faster convergence? A branch or patch with some fprintfs would help.

rkd77 commented 11 months ago

@amanvm There are many places where it can break. First I want to know how it "looks" like. In elinks F9 -> File -> Save formatted document (save with .txt extension). Please, create tarball with a few cases, original files and formatted documents. BTW, in one of previous message there was: ELinks 0.17.GIT c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty dirty means that you modified sources. What was changed?

rkd77 commented 11 months ago

Another question. How is rendered plain text with this character? Also "tofu" or ok?

0-issue commented 11 months ago

@rkd77 Ok I have something for you! I used the test document from pragmatapro's repository. As that is one of the most comprehensive terminal fonts, and the test file has all of its glyphs mentioned with unicode code points. There is a free clone called pragmasevka, you are welcome to try viewing the documents with that.

There are 3 files, and 2 screenshots attached here:

  1. All_chars.txt: This is unmodified test document mentioned above.
  2. All_chars_elinks.txt: This is the output of elinks -dump All_chars.txt.
  3. All_chars_links.txt: This is the output of links -dump All_chars.txt.

Files: All_chars.txt All_chars_elinks.txt All_chars_links.txt

Screenshots: 1) Compares All_chars.txt with All_chars_elinks.txt. As you can see the first tofu appears at U+00C5. Compare the glyph near the cursor in right window to the glyph in left window.

Screenshot 2023-07-28 at 5 24 35 PM

2) Compares All_chars.txt with All_chars_links.txt. As you can see no tofu appears at U+00C5, links doesn't have the problem. Compare the glyph near the cursor in right window to the glyph in left window.

Screenshot 2023-07-28 at 5 26 19 PM

More tofu's can be seen on the screen, and by downloading the txt files to see it for yourself.


Regarding your other question about "ELinks 0.17.GIT c09b5da-dirty": I just tried installing the homebrew version with from head branch using command brew install -s --HEAD, and --version on homebrew's head version build also had -dirty at the end. So, there is something else that is causing it, other than manual updates to files.

rkd77 commented 11 months ago

@amanvm on branch utf I added debug statements and test/chars.txt. Please compile, and check elinks -dump chars.txt 2> log and show log.

0-issue commented 11 months ago

@rkd77 Here we go, this is what the log file has:

goto charsets.c:758:utf8_to_unicode
charsets.c:751:utf8_to_unicode
rkd77 commented 11 months ago

Added more debug statements. Could you rerun test? Which compiler is it?

0-issue commented 11 months ago

@rkd77 Here's the output:

goto charsets.c:759:utf8_to_unicode
str[0]=195
str[1]=32
str[1] & 0xc0 = 0
charsets.c:751:utf8_to_unicode
str[0]=195

Clang, as gcc/g++ is actually clang/clang++ on Apple macOS. This is also the aarch64 (ARM64) version.

% /usr/bin/gcc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

% /usr/bin/g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
rkd77 commented 11 months ago

Added another commit to the utf branch. I disabled maybe_preformat_hook in dump to exclude it from suspected. Second issue is str[1]=32 (space). In chars.txt I replaced spaces with digits, so this time I guess instead of 32 there will be some digit. If there is no error, then guilty is preformat hook. Please git pull, compile and rerun elinks -dump chars.txt

0-issue commented 11 months ago

@rkd77 It's the same stderr output as before:

goto charsets.c:759:utf8_to_unicode
str[0]=195
str[1]=32
str[1] & 0xc0 = 0
charsets.c:751:utf8_to_unicode
str[0]=195

stdout has this:

   U+00C0 1À2Á3Â4Ã5Ä6�7Æ8Ç9È
0-issue commented 11 months ago

@rkd77 One observation: If I open the document without -dump, the character renders correctly (no error is seen). Are you sure the path taken by application for -dump is the same?

Actually, the error wasn't seen on a previous version of elinks too for this Å when not opened with -dump. But error is always still seen with S with Caron (Š). So there is some difference in -dump and non-dump behavior. Let's try the S with Caron (Š) and its neighbors perhaps?

rkd77 commented 11 months ago

dump interprets document as html, normal view as plain text. You can check latest commits and show log. I'm slowly running out of ideas.

0-issue commented 11 months ago

@rkd77 Here's the log: chars.log. Not sure if it matters, here's the entire build log: build.log.

rkd77 commented 11 months ago

@amanvm, thanks, could you continue? I make mistake in commit log, but we are closer. git pull, compile and the same log.

0-issue commented 11 months ago

@rkd77 Sure, I am here to help. Here's the output chars.log, there are new compiler warnings not sure if they matter: build.log.

rkd77 commented 11 months ago

I guess isspace returns different results than on Linux. warnings don't matter here, at least not yet. Please rerun test.

0-issue commented 11 months ago

@rkd77 Here it is: chars.log, and build.log.

rkd77 commented 11 months ago

I added code for isspace. Could you check whether it works? You can redirect stderr to /dev/null

0-issue commented 11 months ago

@rkd77 It works well everywhere now! No tofus!

Btw, I would recommend you to mention this somewhere that users should close the other instances of elinks before they try their hands on a new version. If old version is open, the old behavior persists for some reason even with new binary. When I closed all old instances, the new binary's behavior kicked in. I know you have some socket file to communicate between elinks instances, not sure though how it is being used, couldn't find much info in documentation.

rkd77 commented 11 months ago

This commit was added to the master branch. Likely more characters must be added to isspace.

In docs there is info about sessions and elinks instances.