Fix open file error in certain code pages when file path contains non-ASCII chars

myd7349 commented 1 year ago

I encountered a bug recently in the pyedflib library that affects Windows operating systems with non-ASCII characters in the file paths, especially in environments using CJK character sets. This issue arises when the system's default encoding is not UTF-8, but instead a locale-specific code page. Under such circumstances, pyedflib might fail to open files, rendering even solutions like get_short_path_name ineffective.

Test script:

import pyedflib

file_name = r'C:\Users\myd\Desktop\带 Annotation 的 EDF 文件\S001R01.edf'

pyedflib.edfreader._debug_parse_header(file_name)

edf = pyedflib.EdfReader(file_name, pyedflib.READ_ALL_ANNOTATIONS,
                         pyedflib.CHECK_FILE_SIZE)
annotations = edf.readAnnotations()
for annotation in annotations:
    print(type(annotation), annotation)

print(edf.getStartdatetime(), edf.getStartdatetime().microsecond)
edf.close()

C:\Users\myd\Desktop\dump_annotations.py:34: UserWarning: the filename C:\Users\myd\Desktop\带 Annotation 的 EDF 文件\S001R01.edf contains Unicode, but Windows does not fully support this. Please consider changing your locale to support UTF8. Attempting to load file via workaround (https://github.com/holgern/pyedflib/pull/100)
  edf = pyedflib.EdfReader(file_name, pyedflib.READ_ALL_ANNOTATIONS,
Traceback (most recent call last):
  File "pyedflib\_extensions\_pyedflib.pyx", line 145, in pyedflib._extensions._pyedflib.CyEdfReader.__init__
  File "pyedflib\_extensions\_pyedflib.pyx", line 208, in pyedflib._extensions._pyedflib.CyEdfReader.open
  File "pyedflib\_extensions\_pyedflib.pyx", line 179, in pyedflib._extensions._pyedflib.CyEdfReader.check_open_ok
FileNotFoundError: C:\Users\myd\Desktop\带 Annotation 的 EDF 文件\S001R01.edf: can not open file, no such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\myd\Desktop\dump_annotations.py", line 34, in <module>
    edf = pyedflib.EdfReader(file_name, pyedflib.READ_ALL_ANNOTATIONS,
  File "pyedflib\_extensions\_pyedflib.pyx", line 159, in pyedflib._extensions._pyedflib.CyEdfReader.__init__
  File "pyedflib\_extensions\_pyedflib.pyx", line 208, in pyedflib._extensions._pyedflib.CyEdfReader.open
  File "pyedflib\_extensions\_pyedflib.pyx", line 179, in pyedflib._extensions._pyedflib.CyEdfReader.check_open_ok
FileNotFoundError: C:\Users\myd\Desktop\带ANNO~1\S001R01.edf: can not open file, no such file or directory

In my case, get_short_path_name workaround doesn't work anymore:

import ctypes

def get_short_path_name(long_name):
    """
    Gets the short path name of a given long path.
    https://stackoverflow.com/a/23598461/200291
    """
    import ctypes
    from ctypes import wintypes
    _GetShortPathNameW = ctypes.windll.kernel32.GetShortPathNameW
    _GetShortPathNameW.argtypes = [wintypes.LPCWSTR, wintypes.LPWSTR, wintypes.DWORD]
    _GetShortPathNameW.restype = wintypes.DWORD
    output_buf_size = 0
    while True:
        output_buf = ctypes.create_unicode_buffer(output_buf_size)
        needed = _GetShortPathNameW(long_name, output_buf, output_buf_size)
        if output_buf_size >= needed:
            return output_buf.value
        else:
            output_buf_size = needed

file_name = r'C:\Users\myd\Desktop\带 Annotation 的 EDF 文件\S001R01.edf'
print(get_short_path_name(file_name))

The original file path: C:\Users\myd\Desktop\带 Annotation 的 EDF 文件\S001R01.edf. File path returned by get_short_path_name: C:\Users\myd\Desktop\带ANNO~1\S001R01.edf.

As we can see, there are still non-ASCII characters in file path returned by get_short_path_name.

To address this problem, I created this PR:

If the input file path is of type bytes, it's directly passed to the C interface. This approach is based on the assumption that users are likely to have a better understanding of their local encoding.
Initially attempt to encode the file path using UTF-8 encoding.
If the UTF-8 encoded path fails to open, an attempt is made to use the locale-specific code page associated with the user's environment. This can be obtained using locale.getpreferredencoding(). For example, on a Simplified Chinese Windows system, this function would return cp936. This approach successfully resolves the majority of file opening failures on Windows systems.

skjerns commented 1 year ago

that seems like a good fix - I'll review it once I'm back from vacation. Thanks a lot!

skjerns commented 1 year ago

One thing: could you include the test for new behavior in a unit test?

myd7349 commented 1 year ago

One thing: could you include the test for new behavior in a unit test?

Sure.

myd7349 commented 1 year ago

After some consideration, I have to admit that writing unit tests for the newly added code this time is a bit challenging. Let's take my PC as an example. My machine is running Windows 10, set to Simplified Chinese with the default local encoding being GBK/GB18030, and the console's default code page is CP936. In MSVC, when I call the C function fopen to open a file, MSVC will interpret the file name using GBK/CP936. If the file name is passed from Python code and contains non-ASCII characters (such as CJK characters), in order to ensure that fopen can open the file, it's necessary to encode the string from Python as CP936 and pass it to fopen in C.

However, when it comes to the CI environment, its default encoding might not be GBK/CP936 or UTF-8. If the local encoding isn't UTF-8, my newly added test test_read_bytes might fail. In fact, I should write test_read_bytes like this:

signals2, _, _ = highlevel.read_edf(self.test_unicode_at_start_2.encode(locale.getpreferredencoding()))

But this introduces another issue: in some regions, their default encoding cannot encode Chinese characters. For instance, in cases where the default encoding is latin-1, "中文".encode('latin-1') would result in an error.

myd7349 commented 1 year ago

In fact, regarding this issue, I have another solution in mind. It involves patching edflib.c. We could create a function called fopen_utf8(const char *file), and then in the setup.py file, when using MSVC, we can replace the fopen function in edflib.c with our fopen_utf8 function. This approach offers an alternative solution.

https://github.com/myd7349/pyedflib/commit/477694d0a6a92df359c6d92236181c1fede132e3

skjerns commented 1 year ago

In fact, regarding this issue, I have another solution in mind. It involves patching edflib.c. We could create a function called fopen_utf8(const char *file), and then in the setup.py file, when using MSVC, we can replace the fopen function in edflib.c with our fopen_utf8 function. This approach offers an alternative solution.

myd7349@477694d

that sounds like an interesting solution - feel free to open a PR with it

myd7349 commented 1 year ago

In fact, regarding this issue, I have another solution in mind. It involves patching edflib.c. We could create a function called fopen_utf8(const char *file), and then in the setup.py file, when using MSVC, we can replace the fopen function in edflib.c with our fopen_utf8 function. This approach offers an alternative solution. myd7349@477694d

that sounds like an interesting solution - feel free to open a PR with it

Hi! @skjerns I have just created a new PR.

holgern / pyedflib

Fix open file error in certain code pages when file path contains non-ASCII chars #235