strayge / pylnk

Python library for reading and writing Windows shortcut files (.lnk). Python 3 only.
GNU Lesser General Public License v3.0
82 stars 11 forks source link

diacritic characters does truncated #20

Open andry81 opened 2 years ago

andry81 commented 2 years ago

As noted here: https://stackoverflow.com/questions/39365489/how-do-you-keep-diacritics-in-shortcut-paths

The WScript.Shell implementation does not support diacritic characters in an Unicode string in case of TargetPath shortcut property. But this module has the same issue:

c:\1.txt.lnk -> c:\ööö\1.txt

>pylnk p c:\1.txt.lnk _link_info._path
c:\ooo\1.txt

But WorkingDirectory property is not affected:

>pylnk p c:\1.txt.lnk _work_dir
c:\ööö

I've compared with https://github.com/Matmaus/LnkParse3 implementation and it returns more reliable results:

>lnkparse c:\1.txt.lnk
...

   LINK INFO:
      Link info flags: 1
      Local base path: C:\ooo\1.txt
      Common path suffix:
      Local base unicode: C:\ööö\1.txt
      Common path suffix unicode: .\ööö\1.txt6C:\ööödz
...

   DATA
      Relative path: .\ööö\1.txt
      Working directory: C:\ööö

When pylnk3 is not:

>pylnk3 p c:\1.txt.lnk _link_info.local_base_path
C:\ooo\1.txt

I've tried to change the code:

#DEFAULT_CHARSET = 'cp1251'
DEFAULT_CHARSET = 'utf-8'

But it still returns a truncated variant. Seems the app does read only one property field (Ansi) instead of 2 (Ansi+Unicode) as LnkParse3 does.

strayge commented 2 years ago

Hi, thanks for interesting issue.

LinkInfo structure contains only one field with path.
Looks like it can be encoded as utf-8.

Can you check diacritic_characters branch with possible fix?

andry81 commented 2 years ago

Can you check diacritic_characters branch with possible fix?

c:\Work\OpenSource\pylnk\diacritic_characters>c:\python\x64\310\python
Python 3.10.1 (heads/3.10.1-win7:830a41fd9d, Dec 12 2021, 11:29:02) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pylnk3
>>> lnk = pylnk3.Lnk('d:\\1.txt.lnk')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 1504, in __init__
    self._parse_lnk_file(f)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 1555, in _parse_lnk_file
    self._link_info = LinkInfo(lnk, unicode=self.link_flags.IsUnicode)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 994, in __init__
    self._parse_path_elements(lnk)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 1026, in _parse_path_elements
    self.local_base_path = read_cstring(lnk, encoding=self.encoding)
  File "c:\Work\OpenSource\pylnk\diacritic_characters\pylnk3.py", line 186, in read_cstring
    return s.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 3: invalid start byte

I've just created d:\ööö\1.txt directory and file. Then just have used ctrl-c and Windows Explorer context menu to paste as shortcut into d:\1.txt.lnk.

strayge commented 2 years ago

Can you share broken lnk file?

Can't reproduce myself.
Win10 En writes utf-8 path with diacritics into LinkInfo.
Win7 Ru writes acsii path (converts diacritic symbols to closest latin ones) into LinkInfo.
Both reads without errors with new branch.

andry81 commented 2 years ago

Can you share broken lnk file?

1_buggy.txt.zip

I suspect there is some other format than utf-8.

By the way the 0xf6 is code of the ö character.

strayge commented 2 years ago

it's cp1252, but i does not know how choice correct encoding

andry81 commented 2 years ago

it's cp1252

How did you find that? There is at least 3 code pages which has no difference: 1250, 1257, 1258.

, but i does not know how choice correct encoding

You can create the --chcp <str> parameter or something for that. And add --ignore-decode-errors to call decode(..., errors='ignore') instead.

strayge commented 2 years ago

How did you find that? There is at least 3 code pages which has no difference: 1250, 1257, 1258.

Just guesses. It's default for english Windows. And it's decodes path correctly.
You can try master branch with changed DEFAULT_CHARSET to cp1252.

andry81 commented 2 years ago

Does there exist instructions how to build executable in the Scripts?

andry81 commented 1 year ago

Can you add the same fix as in another link parser repository?

https://github.com/vphpersson/lnk_parser/issues/3

https://github.com/vphpersson/lnk_parser/commit/86d1e05dcee3b011ff9874be2707380662c551ef#diff-6ab89fa0c7d834c397b470492899f182394aaa22564e8050f1983608ad24037aR9-R18

andry81 commented 1 year ago

Another solution here is that. If try to use --json print:

{
    "relative_path": ".\\\u00f6\u00f6\u00f6\\1.txt",
    "work_dir": "D:\\\u00f6\u00f6\u00f6",
    "link_info": {
        "local_base_path": "D:\\\u0446\u0446\u0446\\1.txt"
    },
}

It does print correct characters in case of relative_path property. May be add an option to decode the TargetPath property as composition of work_dir + relative_path as an alternative?