Open 0-issue opened 1 year ago
Similar problem with unicode non breaking space... it is rendered as tofu by elinks, and not by links.
% printf "a 8" | xxd
00000000: 61c2 a038 a..8
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<head>
<title>Test UTF-8 NBSP</title>
</head>
<body>
<p class="indent">Chapter 8</p>
</body>
</html>
links output:
elinks output:
Did you compile elinks with utf-8 enabled? -Dutf-8=true
@rkd77 I have never used meson, so am not sure if the build options inmeson_options.txt
are picked up by the script. Here's are the steps I followed:
./autogen.sh
./configure
make
sudo make install
I am not sure if meson_options.txt
is picked up by this flow. I would guess not? I couldn't find any installation instructions using meson in your install instructions file INSTALL. Excerpt from ./configure
. As I can see it did not pick up true
field for 256-colors
/true-color
options from meson_options.txt
. Are there 2 different build flows here? If yes, how can I pass those options to configure? (it doesn't like options like -Dutf-8=true
).
The following feature summary has been saved to features.log
Feature summary:
Documentation Tools ............. AsciiDoc, XmlTo, Pod2HTML
Manual Formats .................. HTML (one file), HTML (multiple files)
Man Page Formats ................ HTML, man (groff)
API Documentation ............... no
gpm ............................. no
terminfo ........................ no
zlib ............................ yes
bzlib ........................... yes
zstd ............................ no
brotli .......................... no
lzma ............................ no
idn2 ............................ no
Bookmarks ....................... yes
XBEL bookmarks .................. yes
ECMAScript (JavaScript) ......... no
Browser scripting ............... no
libev ........................... no
libevent ........................ no
SSL ............................. GNUTLS
Native Language Support ......... yes
System gettext .................. no
Cookies ......................... yes
Form history .................... yes
Global history .................. yes
Mailcap ......................... yes
Mimetypes files ................. yes
IPv6 ............................ yes
BitTorrent protocol ............. no
Data protocol ................... yes
URI rewriting ................... yes
Local CGI ....................... no
DOS Gateway Interface ........... no
Finger protocol ................. no
FSP protocol .................... no
FTP protocol .................... yes
Gemini protocol ................. no
Gopher protocol ................. no
NNTP protocol ................... no
Samba protocol .................. no
Mouse handling .................. yes
BSD sysmouse .................... no
88 colors ....................... no
256 colors ...................... no
true color ...................... no
Exmode interface ................ no
LEDs ............................ yes
Marks ........................... yes
Cascading Style Sheets .......... yes
HTML highlighting ............... no
DOM engine ...................... no
Backtrace ....................... yes
No root exec .................... no
Debug mode ...................... no
Fast mode ....................... no
Own libc stubs .................. no
Small binary .................... no
UTF-8 ........................... yes
Combining characters ............ no
Reproducible builds ............. no
Check codepoints ................ no
Regexp searching ................ no (TRE not found)
Here is a simple build script for meson:
rm -rf /dev/shm/builddir
meson setup /dev/shm/builddir \
-D88-colors=false \
-D256-colors=true \
-Dapidoc=false \
-Dpdfdoc=false
...
and so on
meson compile -C /dev/shm/builddir
and cd /dev/shm/builddir && ninja install
Seems configure script also built binary with utf-8 support. What is your locale LANG, LC_ALL ? Which terminal? Which distribution?
On Debian 12, konsole and LANG=pl_PL.UTF-8 is displayed fine.
@rkd77 On macOS aarch64. macOS don't have pl_PL.UTF-8
, it is en_US.UTF-8
.
% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Terminal: tested on multiple: iTerm2, Alacritty, WezTerm, Kitty. They all have different fonts too... I wonder if it is the Combining characters
feature? Meson config stage passed after disabling a whole bunch of options like gpm
, libcss
, etc. But then the compile stage fails with errors I don't understand. Am adding it the error information below for your reference. My experience with make has been way smoother (no errors). Will try manipulating options for it and get back.
% meson compile -C ~/.build/elinks
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /opt/homebrew/bin/ninja -C /Users/amanmehra/.build/elinks
ninja: Entering directory `/Users/amanmehra/.build/elinks'
[6/185] Compiling C object src/elinks.p/config_cmdline.c.o
FAILED: src/elinks.p/config_cmdline.c.o
cc -Isrc/elinks.p -Isrc -I../../packages/elinks/src -I. -I../../packages/elinks -I/opt/homebrew/Cellar/zlib/1.2.13/include -I/opt/homebrew/Cellar/tre/0.8.0/include -I/opt/homebrew/Cellar/openssl@3/3.1.1_1/include -I/opt/homebrew/Cellar/libidn2/2.3.4_1/include -I/opt/homebrew/opt/icu4c/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/ncurses/include -I/opt/homebrew/opt/ncurses/include/ncursesw -I/opt/homebrew/opt/gdk-pixbuf/include/gdk-pixbuf-2.0 -I/opt/homebrew/opt/zlib/include -fcolor-diagnostics -Wall -Winvalid-pch -O0 -g '-DGETTEXT_PACKAGE="elinks"' '-DBUILD_ID="c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty"' -DHAVE_CONFIG_H -fno-strict-aliasing -Wno-address -MD -MQ src/elinks.p/config_cmdline.c.o -MF src/elinks.p/config_cmdline.c.o.d -o src/elinks.p/config_cmdline.c.o -c ../../packages/elinks/src/config/cmdline.c
../../packages/elinks/src/config/cmdline.c:173:14: error: call to undeclared function 'idn2_to_ascii_lz'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
int code = idn2_to_ascii_lz(idname, &idname2, 0);
^
../../packages/elinks/src/config/cmdline.c:175:15: error: use of undeclared identifier 'IDN2_OK'
if (code == IDN2_OK) {
^
2 errors generated.
[17/185] Compiling C object src/elinks.p/document_html_parser_forms.c.o
ninja: build stopped: subcommand failed.
UPDATE: configure with --enable-combining doesn't change anything.
UPDATE: I installed elinks in an Arch VM and opened it in the same tmux session on macOS. The locale, font, terminal, tmux, terminfo is same, but it renders correctly in the Arch VM but not in macOS in adjacent pane of the same tmux session. For some reasons --version does not produce anything on macOS. Here's the --version output:
macOS (+/- Fastmem doesn't matter):
ELinks 0.17.GIT c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty
Built on Jul 24 2023 15:07:24
Features:
Standard, Fastmem, IPv6, gzip(1.2.13), bzip2(1.0.8), UTF-8, Periodic
Saving, Viewer (Search History, Timer, Marks), gettext (ELinks),
Cascading Style Sheets, Protocol (Authentication, File, FTP, HTTP, URI
rewrite, User protocols), SSL (GnuTLS), MIME (Option system, Mailcap,
Mimetypes files), LED indicators, Bookmarks, Cookies, Form History,
Global History, Goto URL History
Arch Linux:
ELinks 0.16.1.1
Built on Jul 24 2023 19:33:32
Features:
Standard, IPv6, gzip(1.2.13), bzip2(1.0.8), zstd(1.5.5), gpm(2.1.0),
UTF-8, Periodic Saving, Viewer (Search History, Timer, Marks), gettext
(ELinks), Cascading Style Sheets, Protocol (Authentication, File, CGI,
FTP, Gemini, HTTP, URI rewrite, User protocols), SSL (OpenSSL), MIME
(Option system, Mailcap, Mimetypes files), LED indicators, Bookmarks,
Cookies, Form History, Global History, Scripting (Lua), Goto URL History
@amanvm Could you confirm, that the same bug (wrong utf-8 letter) occurs on FreeBSD VM ? I have no access to such hardware, but I guess FreeBSD is similar to MacOS in this case.
@rkd77 Just tested, it does not happen in FreeBSD VM! I mean it renders correctly in FreeBSD and Linux. Both tested in same tmux session on macOS with defaults (no config). I also tried this with default config (no config) on macOS, but the problem still persists. So it is not a config problem either... Mine is a aarch64 macOS machine, not sure if that affects anything. Searching "macos virtual machine on linux" shows a whole bunch of videos and guides...
@rkd77 One observation: Unlike most other systems where libs/include files are in standard directories `/usr/localor
/usr/, home-brew on aarch64 macOS recommends
/opt/homebrew. I ran the
otool -L(
lddcommand's equivalent on macOS and
ldd`` on Linux to find that my macOS elinks version had a bunch of missing libs (it still has less). It didn't even link to libssl or libiconv. Updating the configure path for those libs does link it to the respective libs, but still things are the same. Can you eyeball the linked libraries to see if anything more is needed?
On macOS I used this:
./configure --with-openssl=/opt/homebrew/Cellar/openssl@3/3.1.1_1/ --withlibiconv=/opt/homebrew/Cellar/libiconv/1.17/
macOS (otool -L /path/to/elinks
):
% otool -L /usr/local/bin/elinks
/usr/local/bin/elinks:
/opt/homebrew/opt/tre/lib/libtre.5.dylib (compatibility version 6.0.0, current version 6.0.0)
/opt/X11/lib/libX11.6.dylib (compatibility version 11.0.0, current version 11.0.0)
/opt/homebrew/opt/openssl@3/lib/libssl.3.dylib (compatibility version 3.0.0, current version 3.0.0)
/opt/homebrew/opt/openssl@3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
/usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.8)
/opt/homebrew/opt/zlib/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.13)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
/usr/lib/libexpat.1.dylib (compatibility version 7.0.0, current version 8.0.0)
/usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1500.65.0)
ArchLinux (ldd /path/to/elinks
):
% ldd /usr/bin/elinks
linux-vdso.so.1 (0x0000ffffaaf45000)
libtre.so.5 => /usr/lib/libtre.so.5 (0x0000ffffaacf0000)
libssl.so.3 => /usr/lib/libssl.so.3 (0x0000ffffaac20000)
libcrypto.so.3 => /usr/lib/libcrypto.so.3 (0x0000ffffaa780000)
liblua.so.5.4 => /usr/lib/liblua.so.5.4 (0x0000ffffaa720000)
libidn.so.12 => /usr/lib/libidn.so.12 (0x0000ffffaa6d0000)
libzstd.so.1 => /usr/lib/libzstd.so.1 (0x0000ffffaa600000)
libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x0000ffffaa5d0000)
libz.so.1 => /usr/lib/libz.so.1 (0x0000ffffaa5a0000)
libgpm.so.2 => /usr/lib/libgpm.so.2 (0x0000ffffaa580000)
libexpat.so.1 => /usr/lib/libexpat.so.1 (0x0000ffffaa540000)
libc.so.6 => /usr/lib/libc.so.6 (0x0000ffffaa380000)
/lib/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffffaaf0c000)
libm.so.6 => /usr/lib/libm.so.6 (0x0000ffffaa2d0000)
libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x0000ffffaa240000)
unsigned char -> char conversion in the past is suspected. There are not too many released tarballs, so checking them by one by one or by "bisection" could prove this hipotesis. If 0.13.0 fails, I have no idea.
@rkd77 Tested on 0.13.0, its the same there! I wonder if there is any other unicode shaping library that you could use, or how links is able to display these glyphs.
What if charset is added in meta?
<!DOCTYPE html>
<head>
<meta charset="utf-8"/>
<title>Test UTF-8 NBSP</title>
</head>
<body>
<p class="indent">Jaro Šnajdrov</p>
</body>
</html>```
Also how looks like dump:
elinks -dump file.html
and
links -dump file.html
?
@rkd77 For 1) I get an error "Bad url syntax" with elinks, but not with links.
For 2) (the -dump option): same problem. The output from links has correct S with caron, the dump from links has tofu there.
It's a little late here, so any other followup will be after a while. Thanks!
I guess it has something common with detection of encoding. If this ^ commit did not resolve it, I have no idea.
@rkd77 Didn't resolve it. One comment I have is: a lot of unicode seems to just render fine. It's only a subset that doesn't. If you could think of a patch that does some kind of text log generation for interesting function arguments and ret values for a input test document like this, I can volunteer for that for sure.
@amanvm you can prepare test cases and save dumps (elinks --dump) . And show hex view of these dumps.
@rkd77 Am not a unicode/utf-8 expert and we might end up doing a lot of back and forth that way. Don't you want to add a fprintf or two to some important functions that shapes/processes unicode data so there is faster convergence? A branch or patch with some fprintfs would help.
@amanvm There are many places where it can break. First I want to know how it "looks" like. In elinks F9 -> File -> Save formatted document (save with .txt extension). Please, create tarball with a few cases, original files and formatted documents. BTW, in one of previous message there was: ELinks 0.17.GIT c09b5da405dab8e900b6c42a1d0b1dfd06a86f27-dirty dirty means that you modified sources. What was changed?
Another question. How is rendered plain text with this character? Also "tofu" or ok?
@rkd77 Ok I have something for you! I used the test document from pragmatapro's repository. As that is one of the most comprehensive terminal fonts, and the test file has all of its glyphs mentioned with unicode code points. There is a free clone called pragmasevka, you are welcome to try viewing the documents with that.
There are 3 files, and 2 screenshots attached here:
elinks -dump All_chars.txt
.links -dump All_chars.txt
.Files: All_chars.txt All_chars_elinks.txt All_chars_links.txt
Screenshots: 1) Compares All_chars.txt with All_chars_elinks.txt. As you can see the first tofu appears at U+00C5. Compare the glyph near the cursor in right window to the glyph in left window.
2) Compares All_chars.txt with All_chars_links.txt. As you can see no tofu appears at U+00C5, links doesn't have the problem. Compare the glyph near the cursor in right window to the glyph in left window.
More tofu's can be seen on the screen, and by downloading the txt files to see it for yourself.
Regarding your other question about "ELinks 0.17.GIT c09b5da-dirty": I just tried installing the homebrew version with from head branch using command brew install -s --HEAD
, and --version
on homebrew's head version build also had -dirty at the end. So, there is something else that is causing it, other than manual updates to files.
@amanvm on branch utf I added debug statements and test/chars.txt. Please compile, and check elinks -dump chars.txt 2> log and show log.
@rkd77 Here we go, this is what the log file has:
goto charsets.c:758:utf8_to_unicode
charsets.c:751:utf8_to_unicode
Added more debug statements. Could you rerun test? Which compiler is it?
@rkd77 Here's the output:
goto charsets.c:759:utf8_to_unicode
str[0]=195
str[1]=32
str[1] & 0xc0 = 0
charsets.c:751:utf8_to_unicode
str[0]=195
Clang, as gcc/g++ is actually clang/clang++ on Apple macOS. This is also the aarch64 (ARM64) version.
% /usr/bin/gcc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
% /usr/bin/g++ --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Added another commit to the utf branch. I disabled maybe_preformat_hook in dump to exclude it from suspected. Second issue is str[1]=32 (space). In chars.txt I replaced spaces with digits, so this time I guess instead of 32 there will be some digit. If there is no error, then guilty is preformat hook. Please git pull, compile and rerun elinks -dump chars.txt
@rkd77 It's the same stderr output as before:
goto charsets.c:759:utf8_to_unicode
str[0]=195
str[1]=32
str[1] & 0xc0 = 0
charsets.c:751:utf8_to_unicode
str[0]=195
stdout has this:
U+00C0 1À2Á3Â4Ã5Ä6�7Æ8Ç9È
@rkd77 One observation: If I open the document without -dump, the character renders correctly (no error is seen). Are you sure the path taken by application for -dump is the same?
Actually, the error wasn't seen on a previous version of elinks too for this Å when not opened with -dump. But error is always still seen with S with Caron (Š). So there is some difference in -dump and non-dump behavior. Let's try the S with Caron (Š) and its neighbors perhaps?
dump interprets document as html, normal view as plain text. You can check latest commits and show log. I'm slowly running out of ideas.
@amanvm, thanks, could you continue? I make mistake in commit log, but we are closer. git pull, compile and the same log.
I guess isspace returns different results than on Linux. warnings don't matter here, at least not yet. Please rerun test.
I added code for isspace. Could you check whether it works? You can redirect stderr to /dev/null
@rkd77 It works well everywhere now! No tofus!
Btw, I would recommend you to mention this somewhere that users should close the other instances of elinks before they try their hands on a new version. If old version is open, the old behavior persists for some reason even with new binary. When I closed all old instances, the new binary's behavior kicked in. I know you have some socket file to communicate between elinks instances, not sure though how it is being used, couldn't find much info in documentation.
This commit was added to the master branch. Likely more characters must be added to isspace.
In docs there is info about sessions and elinks instances.
S with caron (Š) rendered as tofu by elinks, rendered correctly by links
links output:
elinks output: