pandoc / lua-filters

A collection of lua filters for pandoc
MIT License
600 stars 165 forks source link

`local fh = io.open(line)` Unable to read non-latin path and filenames #268

Closed wenbopeng closed 8 months ago

wenbopeng commented 1 year ago

local fh = io.open(line) Unable to read non-latin path and filenames

https://github.com/pandoc/lua-filters/blame/2aa98bfda556c7d4dfb8e30c20b318b6fd1f5091/include-files/include-files.lua#L86

jgm commented 1 year ago

Please be more specific: give a short example that reproduces the problem with full instructions.

cpkio commented 1 year ago

I don't think this is totally true, @wenbopeng. The cyrillics path reads successfully with lua51 test.lua in Windows:

local fh = io.open('R:/Проверка/Тестовая директория/мой файл.txt')
print(
  fh:read()
)

It still may be true for chinese or any other complicated script glyphs.

wenbopeng commented 1 year ago

Yes, my statement may be wrong, to be precise, it does not support Chinese file names and paths. utf-8 is not allowed, io.open only supports ANSI. A method that might work is UTF-8 -> UTF-16, UTF-16 -> ANSI

see: how can i use io.open to open a unicode path in lua - Stack Overflow, 2023-07-12 09:49

cpkio commented 1 year ago

@wenbopeng This is clearly OS-dependent stuff. I can successfully read a file with chinese glyphs in path:

local fh = io.open('R:/测试/测试.txt')
print(
  fh:read()
)

Copypasted glyphs, made a directory, created a file with same name & .txt… Lua5.1

wenbopeng commented 1 year ago

I use include-files.lua, using the following code block in my markdown file

content....

```{.include}
D:/测试.md

content....

pandoc will report an error: `Pandoc warnings:Cannot open file D:/测试.md | Skipping includes`

However, if I use the following block of code

````markdown
content....

```{.include}
D:/test.md

content....


The result is exactly right

Lua 5.4.4  Copyright (C) 1994-2022 Lua.org, PUC-Rio
Embedded in pandoc 3.1.4
wenbopeng commented 1 year ago

@wenbopeng This is clearly OS-dependent stuff. I can successfully read a file with chinese glyphs in path:

local fh = io.open('R:/测试/测试.txt')
print(
  fh:read()
)

Copypasted glyphs, made a directory, created a file with same name & .txt… Lua5.1

image

cpkio commented 1 year ago

I dunno what to say. I use ConEmu on Winx64 with UTF8 enabled.

user@DESKTOP-ILKGP6O 16:39:01 R:\ $ pandoc lua     
Lua 5.4.4  Copyright (C) 1994-2022 Lua.org, PUC-Rio
Embedded in pandoc 3.1.2                           
> fh = io.open('R:/测试.txt')                        
> print(fh)                                        
file (00007ffded19fa90)                            
> print(fh:read())                                 
dfsdfsdf                                           
>                                                  
jgm commented 8 months ago

I'm going to close this, pending more useful information, because it looks like there is a way to do this; the issue is something in OP's setup.

bpj commented 8 months ago

Expecting non-ASCII file names to work in Lua is to expect a bit much. Lua basically brags about being bytes-only, Pandoc expects UTF-8 input and the OS encoding may be anything. The only reliable fix is to rename the file to ce4shi4.txt, or since that means "test" to romanize/Anglicize the name of the actual file as appropriate. I use a more-than-ASCII language but I don't expect non-ASCII file/directory names to work when dealing with commandline programs. It means some strictures, and sometimes you have to temporarily copy things to ASCII names.

alerque commented 8 months ago

You said you copy and pasted the file name, but were the copy and paste operations done from the asme app? If you copied form Explorer and pasted into a terminal or copied from a terminal and pasted into an editor or some other miss-match combination it is quite likely that the encoding for the characters is different. Just because two apps visually show the same file name doesn't mean they are doing so using the same encoding, and potentially neither actually are 1-for-1 with how the file system has encoded the same.

You might try using Lua itself to list the files in the directory and opening them. That will almost certainly get you a byte string representation that you can turn around and re-use in your include.

tarleb commented 4 months ago

Use io.open with pandoc.text.toencoding to make it work with non-UTF-8 filesystems.

https://pandoc.org/lua-filters#pandoc.text.toencoding