sharkdp / bat

A cat(1) clone with wings.
Apache License 2.0
48.73k stars 1.23k forks source link

Some utf8 output from man appears as escaped bytes #2479

Open grodin opened 1 year ago

grodin commented 1 year ago

What steps will reproduce the bug?

  1. execute man bat (or other manpages) with MANPAGER="sh -c 'col -bx | bat -l man -p'" on a terminal with a width small enough that man hyphenates some words.

What happens?

man output such as

It also communicates with git(1) to show  modifications  with  re\xe2\x80\x90
       spect  to  the  git  index

What did you expect to happen instead?

The output should be:

It also communicates with git(1) to show  modifications  with  re‐
spect  to  the  git  index

How did you install bat?

Occurs with bat v0.22.1 installed by brew on Ubuntu 22.04 and v0.19 installed on the same system via apt.

bat version and environment

> bat --diagnostic

Software version

bat 0.22.1

Operating system

Linux 5.15.0-60-generic

Command-line

bat --diagnostic 

Environment variables

SHELL=/usr/bin/zsh
PAGER=less
LESS=-R
LANG=en_GB.UTF-8
LC_ALL=<not set>
BAT_PAGER=<not set>
BAT_CACHE_PATH=<not set>
BAT_CONFIG_PATH=<not set>
BAT_OPTS=<not set>
BAT_STYLE=<not set>
BAT_TABS=<not set>
BAT_THEME=<not set>
XDG_CONFIG_HOME=<not set>
XDG_CACHE_HOME=<not set>
COLORTERM=truecolor
NO_COLOR=<not set>
MANPAGER='sh -c '\''col -bx | bat -l man -p'\'''

System Config file

Could not read contents of '/etc/bat/config': No such file or directory (os error 2).

Config file

Could not read contents of '/home/jscdev/.config/bat/config': No such file or directory (os error 2).

Custom assets metadata

Could not read contents of '/home/jscdev/.cache/bat/metadata.yaml': No such file or directory (os error 2).

Custom assets

'/home/jscdev/.cache/bat' not found

Compile time information

Less version

> less --version 
less 590 (GNU regular expressions)
Copyright (C) 1984-2021  Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Home page: https://greenwoodsoftware.com/less

More details

A partial workaround I've discovered is to run man with --no-hyphenation|--nh but there are still some unicode code points that are making it to the output. Here's a snippet of MANPAGER="sh -c 'col -bx | bat -l man -p'" man --nh man

See the \xe2\x80\x9cWarnings\xe2\x80\x9d node in info groff

and then MANPAGER="bat -A" man --nh man

··See·the·\u{201c}Warnings\u{201d}·node·in·i␈in␈nf␈fo␈o·g␈gr␈ro␈of␈ff␈f·

I've checked that it's not caused by less with by running with BAT_PAGER set and empty.

Terminal emulator is alacritty, but I can't see what difference that would make.

keith-hall commented 1 year ago

Thanks for reporting. Interestingly, I haven't been able to replicate this. For a concrete example, I tried with xfce4-terminal v0.8.10 sized 52 x 24: image

SaElAh commented 1 year ago

Similar problem for me on Kitty (Fedora 36 on Sway).

1mNAME0m
       ls - list directory contents

1mSYNOPSIS0m
       1mls 22m[4mOPTION24m]... [4mFILE24m]...

1mDESCRIPTION0m
       List  information about the FILEs (the current directory by default).  Sort entries alphabetically if none of 1m-cftuvSUX 22mnor 1m--sort 22mis
       specified.
sharkdp commented 1 year ago

@SaElAh Yours looks like a different problem with ANSI escape codes, not with unicode characters. Please search the issue tracker if this has been reported and open a new ticket otherwise.

sharkdp commented 1 year ago

I cannot reproduce this either.

What is your locale? Maybe it's related to that?

Seems like I don't even get Unicode characters in the first place:

▶ LANG=C MANPAGER="sh -c 'col -bx | grep Warnings | hexdump -C'" man man
00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 64 65  |              de|
00000010  66 61 75 6c 74 20 69 73  20 22 6d 61 63 22 2e 20  |fault is "mac". |
00000020  20 53 65 65 20 74 68 65  20 22 57 61 72 6e 69 6e  | See the "Warnin|
00000030  67 73 22 20 6e 6f 64 65  20 69 6e 20 69 6e 66 6f  |gs" node in info|
00000040  20 67 72 6f 66 66 20 20  66 6f 72 20 20 61 20 20  | groff  for  a  |
00000050  6c 69 73 74 20 20 6f 66  20 20 61 76 61 69 6c 61  |list  of  availa|
00000060  62 6c 65 20 20 77 61 72  6e 69 6e 67 0a           |ble  warning.|
0000006d

I have groff 1.22.4

ChocolateOverflow commented 1 year ago

Currently also an issue on Alacritty on Arch Linux (running man ls)

man

christoph-heinrich commented 1 year ago

@ChocolateOverflow Have you tried MANROFFOPT="-c" as suggested in the readme? I had the same problem and this helped.

ChocolateOverflow commented 1 year ago

@christoph-heinrich Yeah MANROFFOPT="-c" seems to fix my issue.

grodin commented 1 year ago

Sorry for not replying to this for ages!

I have LANG=en_GB.UTF-8.

Running LANG="C" MANPAGER=sh -c 'col -bx | bat -l man -p' man man displays the expected output so it's clearly a locale related issue. I'm not sure if it's expected to need to use LANG="C" but aliasing man='LANG=C man' is a usable workaround.

DabeDotCom commented 6 months ago

FWIW, I stumbled across something similar today — and I can see how, in *my* case, the Problem Exists Between the Keyboard And Chair...

(I'm just documenting it here in case it helps anybody else, as well as for posterity — i.e., when I run into the same problem again in six months, this will show up when I Google it, hehe!)

Anyway, I'm used to doing, e.g.:

# Run a command and save its output:
bash% someCmd > /tmp/out.1

# Then making some changes and re-running:
bash% someCmd > /tmp/out.2

# So I can:
bash% diff /tmp/out.{1,2}

# Which was fine until I ran:
bash% less /tmp/out.1

As a Minimal Reproducible Example, say I have two files named, e.g., /tmp/one.1 and /tmp/two.2:

bash% printf '\033[31mRed\033[m\n' > /tmp/one.1
bash% od -c /tmp/one.1
0000000  033   [   3   1   m   R   e   d 033   [   m  \n
0000014

### Prepend a backslash...
bash% printf '\\\033[31mRed\033[m\n' > /tmp/two.2
bash% od -c /tmp/two.2
0000000    \ 033   [   3   1   m   R   e   d 033   [   m  \n
0000015

Note that /tmp/two.2 is the same as /tmp/one.1 except it has a preceding \ backslash before the escape character...

Now, If I run:

###  Sanitize environment...
bash% unset BAT_STYLE BAT_THEME; export BAT_CONFIG_PATH=/dev/null

bash% cat /tmp/one.1 | bat    # Works
───────┬──────────────────────────────────
       │ STDIN
───────┼──────────────────────────────────
   1   │ Red
───────┴──────────────────────────────────

bash% cat /tmp/two.2 | bat    # Works
───────┬──────────────────────────────────
       │ STDIN
───────┼──────────────────────────────────
   1   │ \Red
───────┴──────────────────────────────────

bash% bat /tmp/one.1          # Works
───────┬──────────────────────────────────
       │ File: /tmp/one.1
───────┼──────────────────────────────────
   1   │ Red
───────┴──────────────────────────────────

bash% bat /tmp/two.2          # Not What I Was Expecting!
───────┬──────────────────────────────────
       │ File: /tmp/two.2
───────┼──────────────────────────────────
   1   │ \[0m[31mRed
───────┴──────────────────────────────────

### However...
bash% bat -l txt /tmp/two.2   # Works
───────┬──────────────────────────────────
       │ File: /tmp/two.2
───────┼──────────────────────────────────
   1   │ \Red
───────┴──────────────────────────────────

### And...
bash% cat /tmp/one.1 | bat -l troff    # Gives The Funny Output 💡
───────┬──────────────────────────────────
       │ File: STDIN
───────┼──────────────────────────────────
   1   │ \[0m[31mRed
───────┴──────────────────────────────────

So my mistake was using .<digit> for something other than "nroff -man" files! «grin»


PS — I will add that nroff -man /usr/share/man/man1/bash.1 | bat -l man gives me some funny:

SEE ALSO
       Bash Reference Manual, Brian Fox and Chet Ramey
       The Gnu Readline Library, Brian Fox and Chet Ramey
       The Gnu History Library, Brian Fox and Chet Ramey
       Portable Operating System Interface [0m4m(POSIX) Part 2: Shell and Utili‐
       ties, IEEE
       sh[0m24m(1), ksh[0m24m(1), csh[0m24m(1)
_______emacs[0m24m(1), vi[0m24m(1)
_______readline[0m24m(3)

output under macOS... mandoc does better there, but only colors "SEE" instead of "SEE ALSO" — and the latter does *not* like the x^Hx pseudo-bold hack! — but I don't trust my understanding of -l man to know whether or not *I'm* the one doing it wrong... again! :-}

eth-p commented 6 months ago

PS — I will add that nroff -man /usr/share/man/man1/bash.1 | bat -l man gives me some funny:

SEE ALSO
       Bash Reference Manual, Brian Fox and Chet Ramey
       The Gnu Readline Library, Brian Fox and Chet Ramey
       The Gnu History Library, Brian Fox and Chet Ramey
       Portable Operating System Interface [0m4m(POSIX) Part 2: Shell and Utili‐
       ties, IEEE
       sh[0m24m(1), ksh[0m24m(1), csh[0m24m(1)
_______emacs[0m24m(1), vi[0m24m(1)
_______readline[0m24m(3)

I did some investigating a week ago, and it appears the man/nroff/groff implementation used by most Linux distros has switched to emitting ANSI escape sequences by default instead of overtyping (the pseudo-bold hack).

The man syntax definition doesn't handle ANSI escape sequences and bat's ANSI parsing doesn't work across highlighting regions, which is likely why you're encountering broken sequences. You'll want to pass -c to nroff to have it revert back to using overtyping.

output under macOS... mandoc does better there, but only colors "SEE" instead of "SEE ALSO" — and the latter does *not* like the x^Hx pseudo-bold hack! — but I don't trust my understanding of -l man to know whether or not *I'm* the one doing it wrong... again! :-}

MacOS's mandoc still uses overtyping by default, which is why it behaves a bit better there. I'm on my phone, so I can't test this myself, but try piping into col -bx before piping to bat. That will remove the overtyping, which should help determine if your issue is caused by the backspace character.