`cabal init` incorrectly writes non-ASCII file names in `library:exposed-modules`

Fyrbll commented 4 years ago

Describe the bug Consider a folder hierarchy that looks like

proglet/
|- Typeclass/
   |- Pokémon.hs

Running cabal init --interactive when the working directory is proglet generates the files necessary for cabal v2-build.

Running cabal v2-build, however, gives the unexpected error

Typeclass/Pokémon.hs:1:8: error:
    File name does not match module name:
    Saw: ‘Typeclass.Pokémon’
    Expected: ‘Typeclass.Pokémon’
  |
1 | module Typeclass.Pokémon where
  |        ^^^^^^^^^^^^^^^^^

To Reproduce

Create folder proglet
Create folder proglet/Typeclass
Create file proglet/Typeclass/Pokémon.hs with content

module Typeclass.Pokémon where

x = 1

Change the working directory to proglet
Run cabal init --interactive and complete the prompts this way
- Don't generate a "simple project with sensible defaults"
- Build a library
- Use any version of the Cabal specification
- Any package name of your choosing
- Any package version
- Any license
- Any author name
- Any email
- Any project URL
- Any synopsis
- Any category
- No source directory
- No test suite
- Haskell2010
- No "informative comments"
Run cabal v2-build and observe the error

Expected behavior I expected the build to succeed.

System information

Operating system: macOS 10.14
cabal version: 3.0.0.0
ghc version: 8.6.5

Additional context It seems that the "e with acute accent" under "Saw" is the UTF-8 character U+00E9, while the "e with acute accent" under "Expected" is a combination of the normal letter "e" (U+0065) and some acute accent character.

Since the "e with acute accent" under "Expected" corresponds to the contents of the library:exposed-modules section of the .cabal file, I checked proglet.cabal (I have removed the comment lines below).

cabal-version:       2.4

name:                proglet
version:             0.1.0.0
license:             BSD-3-Clause
license-file:        LICENSE
author:              Fyrbll
maintainer:          unnecessary
extra-source-files:  CHANGELOG.md

library
  exposed-modules:     Typeclass.Pokémon
  build-depends:       base ^>=4.12.0.0
  default-language:    Haskell2010

I changed Typeclass.Pokémon in the file above to Typeclass.Pokémon, where the latter actually uses U+00E9, and then cabal build worked painlessly.

phadej commented 4 years ago

According to https://superuser.com/questions/999232/unicode-filenames-in-windows-vs-mac-os-x.

OS X uses UTF-8. Codepoints are encoded using between one and five bytes. OS X uses Unicode NFD (Normalization Form Canonical Decomposition).

This means that when a Unicode character such as "é" is used in a filename it will always be normalized by the system into a regular ASCII "e" followed by a Unicode combining acute accent, and will always take two codepoints.

NFD (note visible e, i.e. 65)

00000000  54 79 70 65 63 6c 61 73  73 2e 50 6f 6b 65 cc 81  |Typeclass.Poke..|
00000010  6d 6f 6e 0a                                       |mon.|
00000014

NKC (note no e)

00000000  54 79 70 65 63 6c 61 73  73 2e 50 6f 6b c3 a9 6d  |Typeclass.Pok..m|
00000010  6f 6e 0a                                          |on.|
00000013

I'm not 100% sure though what happens, cabal init writes what it gets from the file system, but GHC doesn't find a file based on it. Conversion through String shouldn't destroy this. Finding where the normalization happens (and why) will help to solve the issue.

Fyrbll commented 4 years ago

Thanks so much for the insight! I'll use this information to make whatever progress I can on my end, and if I learn exactly what's going on I'll post my findings here.

Fyrbll commented 4 years ago

Unfortunately, the Stack Overflow answer only applied to the HFS and HFS+ file systems, which were replaced by the APFS file system in macOS High Sierra (10.13) and above. I can confirm that my machine uses APFS, whose normalization rules can be found here. According to these, the system doesn't enforce a single form of Unicode normalization.

APFS accepts only valid UTF-8 encoded filenames for creation, and preserves both case and normalization of the filename on disk in all variants. ... Being normalization-insensitive ensures that normalization variants of a filename cannot be created in the same directory, and that a filename can be found with any of its normalization variants.

I have reason to believe it's the program creating the file that controls how its name is normalized. On my system:

Emacs 26.1: Running M-x save-buffer, then entering the name éclair, results in an NFD normalized file name.
TextEdit: Opening a new file, then saving it using Command + S with the name éclair results in an NFD normalized file name.
Vim 8.0: Opening a new file, then saving it with :w éclair results in an NFC normalized file name.
Running > éclair or touch éclair with bash results in an NFC normalized file name.

(Note: I checked the above statements using xxd)

With this information in mind, when I create the following file with Emacs and save it with the name Pokémon.hs using the save-buffer function, the file name is NFD normalized, whereas the text within the file (most importantly the name of the module) is NFC normalized.

module Pokémon where

x = 1

Note that when cabal init is populating the exposed-modules field, it doesn't venture into the Haskell files themselves to pull out module names. It looks at file names and directory names, trusting that the relative path to a file will match the name of the module declared within it, except with periods in place of slashes.

If I understand correctly, when cabal build, cabal new-build, or cabal v2-build is run, the name of the module is expected to match the name of the file exactly - which won't happen if the module's name and the file's name are normalized differently.

I managed to work around this problem for my local cabal by changing a definition in the where clause of the function scanForModulesIn, located in the module Distribution.Client.Init.Heuristics of the cabal-install project.

I changed

entries <- getDirectoryContents (projectRoot </> dir)

to

entries <- fmap (map (T.unpack . normalize NFKC . T.pack))
    (getDirectoryContents (projectRoot </> dir))

Above, unpack and pack are from Data.Text, while normalize is from Data.Text.Normalize in the unicode-transforms package.

Fyrbll commented 4 years ago

Since this has been marked as a bug now, if the fix above is acceptable (in its current form it adds unicode-transforms as a dependency) I can make a pull request.

phadej commented 4 years ago

I'm not sure that's a correct fix. I don't understand why getDirectoryContents get differently normalized contents.

Without proper understanding when we fix macOS we might break Windows or Linux, so fix of this should have properly.

Also there is: https://github.com/haskell/tar/issues/6 so I suspect that may cause some problems too (or is the problem?)

jneira commented 3 years ago

@emilypi could be this be resolved by #7344?

emilypi commented 3 years ago

@jneira this was explicitly left off that particular ticket, because we weren't sure if it was completely solved. However if someone were to confirm that we did in fact fix this I would be fine with saying it's closed. A regression test for this would be enough for me to make that call

haskell / cabal

`cabal init` incorrectly writes non-ASCII file names in `library:exposed-modules` #6507