jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.66k stars 3.38k forks source link

Man writer doesn't use UTF-8 encoding but escapes all non-Latin letters. #8507

Closed van-de-bugger closed 1 year ago

van-de-bugger commented 1 year ago

Consider an example:

$ cat test.md
Ελληνικά
========

српски հայերեն

Source markdown file includes Greek, Cyrillic, and Armenian letters.

$ pandoc -s -t man < test.md > test.man

$ man -P cat ./test.man
()                                                           ()

Ελληνικά
       српски հայերեն

Pandoc converted markdown to man page, it is ok. However, let's have a look into .man file content:

$ cat test.man
.\" Automatically generated by Pandoc 2.14.0.3
.\"
.TH "" "" "" "" ""
.hy
.SH \[*E]\[*l]\[*l]\[*y]\[*n]\[*i]\[*k]\[u03AC]
.PP
\[u0441]\[u0440]\[u043F]\[u0441]\[u043A]\[u0438]
\[u0570]\[u0561]\[u0575]\[u0565]\[u0580]\[u0565]\[u0576]

Look, all the non-Latin characters are represented as escape sequences. It is not a showstopper, since the rendered man page looks good, but every non-Latin character is represented with 5 bytes (in case of Greek), or 8 bytes (in case of Cyrillic and Armenian). If the characters are not escaped, they would occupy only 2 bytes each. It is just waste of space.

Modern groff allows using UTF-8 encoding in source files:

$ cat test.man
.\" Automatically generated by Pandoc 2.14.0.3
.\"
.TH "" "" "" "" ""
.hy
.SH Ελληνικά
.PP
српски հայերեն

$ groff -D utf8 -m man -T utf8 < test.man
()                                                           ()
Ελληνικά
       српски հայերեն
                                                             ()

Thus, I request the man writer outputs non-Latin character as-is, without converting them to escape sequences.

Pandoc version:

$ pandoc --version
pandoc 2.14.0.3
Compiled with pandoc-types 1.22.1, texmath 0.12.3.3, skylighting 0.10.5.2,
citeproc 0.4.0.1, ipynb 0.1.0.1
User data directory: /home/vdb/.local/share/pandoc
Copyright (C) 2006-2021 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

It is not the last available version. However, I scanned the pandoc release notes for releases after 2.14.0.3, it seems there were no changes in man writer.

BTW, in Fedora 37 man pages in languages with non-Latin writing systems do not use escape sequences. For example, Serbian:

$ cat /usr/share/man/sr/man1/cat.1.gz | gunzip | head -n20
.\" -*- coding: UTF-8 -*-
.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.48.5.
.\"*******************************************************************
.\"
.\" This file was generated with po4a. Translate the source file.
.\"
.\"*******************************************************************
.TH CAT 1 "Августа 2022" "ГНУ coreutils 9.1" "Корисничке наредбе"
.SH НАЗИВ
cat \- concatenate files and print on the standard output
.SH УВОД
\fBcat\fP [\fI\,ОПЦИЈА\/\fP]... [\fI\,ДАТОТЕКА\/\fP]...
.SH ОПИС
.\" Add any additional description here
.PP
Надовежите ДАТОТЕКУ(Е) на стандардни излаз.
.PP
Без ДАТОТЕКЕ, или када је ДАТОТЕКА \-, чита стандардни улаз.
.TP 
\fB\-A\fP, \fB\-\-show\-all\fP

Or Japanese:

$ cat /usr/share/man/ja/man1/cat.1.gz | gunzip | head -n20
.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.47.13.
.TH CAT "1" "2021年5月" "GNU coreutils" "ユーザーコマンド"
.SH 名前
cat \- ファイルの内容を連結して標準出力に出力する
.SH 書式
.B cat
[\fI\,オプション\/\fR]... [\fI\,ファイル\/\fR]...
.SH 説明
.\" Add any additional description here
.PP
ファイル (複数可) の内容を結合して標準出力に出力します。
.PP
ファイルの指定がない場合や FILE が \- の場合, 標準入力から読み込みを行います。
.HP
\fB\-A\fR, \fB\-\-show\-all\fR           \fB\-vET\fR と同じ
.TP
\fB\-b\fR, \fB\-\-number\-nonblank\fR
空行以外に行番号を付ける。\-n より優先される
.HP
\fB\-e\fR                       \fB\-vE\fR と同じ

I am not aware about other distros, though.

jgm commented 1 year ago

It used to be that UTF-8 in man pages was not reliably supported. Perhaps that situation has changed and we can revisit this. In any case, we could keep the present behavior when the --ascii option is used.