python / cpython

The Python programming language
https://www.python.org
Other
63.35k stars 30.34k forks source link

Modules with decomposable characters in module name not found on macOS #84013

Open 5ff1524e-2a8c-470a-9067-7d301b1fac9c opened 4 years ago

5ff1524e-2a8c-470a-9067-7d301b1fac9c commented 4 years ago
BPO 39832
Nosy @brettcannon, @ronaldoussoren, @vstinner, @ned-deily
Files
  • Modules.zip
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['OS-mac', 'interpreter-core', 'type-bug', '3.8'] title = 'Modules with decomposable characters in module name not found on macOS' updated_at = user = 'https://bugs.python.org/Norbert' ``` bugs.python.org fields: ```python activity = actor = 'brett.cannon' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core', 'macOS'] creation = creator = 'Norbert' dependencies = [] files = ['48944'] hgrepos = [] issue_num = 39832 keywords = [] message_count = 5.0 messages = ['363224', '363614', '363774', '363804', '363836'] nosy_count = 5.0 nosy_names = ['brett.cannon', 'ronaldoussoren', 'vstinner', 'ned.deily', 'Norbert'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue39832' versions = ['Python 3.8'] ```

    Linked PRs

    5ff1524e-2a8c-470a-9067-7d301b1fac9c commented 4 years ago

    Modules whose names contain characters that are in precomposed form but can be decomposed in Normalization Form D can’t be found on macOS.

    To reproduce:

    1. Download and unzip the attached file Modules.zip. This produces a directory Modules with four Python source files.

    2. In Terminal, go to the directory that contains Modules.

    3. Run "python3 -m Modules.Import".

    Expected behavior:

    The following lines should be generated: Maerchen Märchen

    Actual behavior:

    The first line, “Maerchen” is generated, but then an error occurs:
    Traceback (most recent call last):
     File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 193, in _run_module_as_main
       return _run_code(code, main_globals, None,
     File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
       exec(code, run_globals)
     File "/Users/business/tmp/pyimports/Modules/Import.py", line 5, in <module>
       from Modules.Märchen import hello2
    ModuleNotFoundError: No module named 'Modules.Märchen'

    Evaluation:

    In the source file Modules/Import.py, the name of the module “Märchen” is written with the precomposed character U+00E4. The file name Märchen.py uses the decomposed character sequence U+0061 U+0308 instead. Macintosh file names commonly use a variant of Normalization Form D in file names – the old file system HFS enforces this, and while APFS doesn’t, the Finder still generates file names in this form. U+00E4 and U+0061 U+0308 are canonically equivalent, so they should be treated as equal in module loading.

    Tested configuration:

    CPython 3.8.2 macOS 10.14.6

    ned-deily commented 4 years ago

    This seems like more an import issue than a uniquely macOS issue. Also, a quick search found bpo-10952 which appears to be similar.

    brettcannon commented 4 years ago

    The import system makes no attempt at normalizing Unicode strings for path comparisons. One would have to probably update FileFinder (https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py#L1392) somehow (assuming the appropriate codec support is available to importlib during start-up).

    5ff1524e-2a8c-470a-9067-7d301b1fac9c commented 4 years ago

    Yes, if the Python runtime caches file names and determines based on the cache whether a file exists, then it needs to normalize both the file names in the cache and the name of the file it’s looking for. As far as I know, both HFS and APFS do this themselves when asked for a file by name, but if you ask for a list of available files, they don’t know what you’re comparing against.

    I don’t think codecs would be involved here; I’d use unicodedata.normalize with either NFC or NFD – doesn’t matter which one.

    brettcannon commented 4 years ago

    Regardless of which module is proposed to solve this, there is still a bootstrapping issue to consider.

    ronaldoussoren commented 1 year ago

    On macOS Ventura only one of the tests below fails, the one with a decomposed characters, which is different from the original report. The difference might be due to changes in the APFS filesystem over time.

    import unittest
    import os
    import sys
    
    class TestImportAccented(unittest.TestCase):
        def assert_importing_possible(self, name):
            filename = f"{name}.py"
            with open(filename, "w") as stream:
                stream.write("SPAM = 'spam'\n")
            self.addCleanup(os.unlink, filename)
    
            values = {}
            exec(f"from {name} import SPAM", values, values)
            self.assertEqual(values["SPAM"], "spam")
            del sys.modules[name]
    
        def test_import_precomposed(self):
            name = 'M\u00E4dchen'
            self.assert_importing_possible(name)
    
        def test_import_normalized(self):
            name = 'M\u0061\u0308dchen'
            self.assert_importing_possible(name)
    
        def test_import_macos_input(self):
            name = 'Mädchen'
            self.assert_importing_possible(name)

    The last test contains the name as I get it when typing it using vim in a terminal, which should be indicative of how most users on macOS will type the name (even when using GUI editors). This should reduce the risk of running into this on newer macOS versions, although it would still be nice to be normalisation aware in the import system because identifiers in code are normalised using NFKC.

    ronaldoussoren commented 10 months ago

    The attached PR should do the trick, it passes a complete test run for me. The PR is marked as DO-NOT-MERGE because the tests are not yet complete.

    The PR uses unicodedata.normalize to normalise the caches in FileFinder, but only when the to be import name is not ASCII. This leads to slightly more completicated code, but should solve the bootstrap issue. This also avoids paying the cost of normalising all cached names when none of the imported names contain non-ASCII values.

    The code doesn't guard against import errors for unicodedata, primarily because the parser also errors out when that extension module is not present.