jaspervdj / patat

Terminal-based presentations using Pandoc
GNU General Public License v2.0
2.37k stars 60 forks source link

Invalid byte sequence #127

Closed klarkc closed 9 months ago

klarkc commented 1 year ago

When opening a markdown file with the character it throws:

hGetContents: invalid argument (invalid byte sequence)

This character is pretty common, for example, it is used in GHC errors:

hello-exe-hello> src/hello.hs:23:35: error:
hello-exe-hello>     • Couldn't match type ‘Password’ with ‘BuiltinData’
hello-exe-hello>       Expected type: template-haskell-2.16.0.0:Language.Haskell.TH.Syntax.Q
hello-exe-hello>                        (template-haskell-2.16.0.0:Language.Haskell.TH.Syntax.TExp
hello-exe-hello>                           (PlutusTx.Code.CompiledCode
hello-exe-hello>                              (BuiltinData -> BuiltinData -> BuiltinData -> ())))
hello-exe-hello>         Actual type: th-compat-0.1.4:Language.Haskell.TH.Syntax.Compat.SpliceQ
hello-exe-hello>                        (PlutusTx.Code.CompiledCode
hello-exe-hello>                           (Password -> Password -> BuiltinData -> ()))
hello-exe-hello>     • In the expression: compile [|| validator ||]
hello-exe-hello>       In the Template Haskell splice $$(compile [|| validator ||])
hello-exe-hello>       In the first argument of ‘mkValidatorScript’, namely
hello-exe-hello>         ‘$$(compile [|| validator ||])’
klarkc commented 1 year ago

It is worthwhile to mention that from previous example also does not work, throwing the same error.

jaspervdj commented 1 year ago

I think this is an encoding / system configuration error. If I copy the above to test.md, I get:

jasper@taiyaki ~/P/patat (master)> file test.md
test.md: UTF-8 Unicode text
jasper@taiyaki ~/P/patat (master)> patat --dump test.md
(works)
jasper@taiyaki ~/P/patat (master)> echo $LANG
en_US.UTF-8

However, if I set LANG to something else like C, or unset it, I get:

jasper@taiyaki ~/P/patat (master)> LANG=C patat --dump test.md
patat: test.md: hGetContents: invalid argument (invalid byte sequence)
jasper@taiyaki ~/P/patat (master)> LANG= patat --dump test.md
patat: test.md: hGetContents: invalid argument (invalid byte sequence)

If you are using UTF-8 in files, you should update your system locale to support this (or call patat with a compatible locale set).


That configuration error aside, in 2023 we can probably assume .md files are encoded in UTF-8, so I can make that the default.

klarkc commented 1 year ago

Hmm, weird my file is reporting this ASCII text, and my $LANG is en_US.utf-8. I wonder what this means. I am using vim to create the files, It is set to utf-8, but still creating this ASCII text file. I even tried to use iconv to convert from ASCII to UTF-8, but there were no changes.

klarkc commented 1 year ago

Actually if I copy the example characters it shows a different encode, I believe it's using the closest encode for the given chars, I tried both with nano and vim.

file reports: Unicode text, UTF-8 text

klarkc commented 1 year ago

I also tried to change my LANG to en_US.UTF-8 (uppercase), just in case, no changes.

This is all my available locales:

$ localectl list-locales
C.UTF-8
en_GB.UTF-8
en_US.UTF-8
pt_BR.UTF-8
pt_PT.UTF-8

My desktop locale differs from my terminal locale, because I prefer to use en_US on terminal apps:

$ localectl status
System Locale: LANG=pt_BR.UTF-8
    VC Keymap: br-abnt2
   X11 Layout: br
    X11 Model: abnt2
$ echo $LANG
en_US.UTF-8
jaspervdj commented 9 months ago

I added a fallback to UTF-8 if file decoding fails in the latest release, v0.9.1.0. That should fix this issue, feel free to re-open if it doesn't.