Open Fyrbll opened 4 years ago
According to https://superuser.com/questions/999232/unicode-filenames-in-windows-vs-mac-os-x.
OS X uses UTF-8. Codepoints are encoded using between one and five bytes. OS X uses Unicode NFD (Normalization Form Canonical Decomposition).
This means that when a Unicode character such as "é" is used in a filename it will always be normalized by the system into a regular ASCII "e" followed by a Unicode combining acute accent, and will always take two codepoints.
NFD (note visible e, i.e. 65)
00000000 54 79 70 65 63 6c 61 73 73 2e 50 6f 6b 65 cc 81 |Typeclass.Poke..|
00000010 6d 6f 6e 0a |mon.|
00000014
NKC (note no e)
00000000 54 79 70 65 63 6c 61 73 73 2e 50 6f 6b c3 a9 6d |Typeclass.Pok..m|
00000010 6f 6e 0a |on.|
00000013
I'm not 100% sure though what happens, cabal init
writes what it gets from the file system,
but GHC doesn't find a file based on it. Conversion through String
shouldn't destroy this.
Finding where the normalization happens (and why) will help to solve the issue.
Thanks so much for the insight! I'll use this information to make whatever progress I can on my end, and if I learn exactly what's going on I'll post my findings here.
Unfortunately, the Stack Overflow answer only applied to the HFS and HFS+ file systems, which were replaced by the APFS file system in macOS High Sierra (10.13) and above. I can confirm that my machine uses APFS, whose normalization rules can be found here. According to these, the system doesn't enforce a single form of Unicode normalization.
APFS accepts only valid UTF-8 encoded filenames for creation, and preserves both case and normalization of the filename on disk in all variants. ... Being normalization-insensitive ensures that normalization variants of a filename cannot be created in the same directory, and that a filename can be found with any of its normalization variants.
I have reason to believe it's the program creating the file that controls how its name is normalized. On my system:
M-x save-buffer
, then entering the name éclair
, results in an NFD normalized file name.éclair
results in an NFD normalized file name.:w éclair
results in an NFC normalized file name.> éclair
or touch éclair
with bash
results in an NFC normalized file name.(Note: I checked the above statements using xxd
)
With this information in mind, when I create the following file with Emacs and save it with the name Pokémon.hs
using the save-buffer
function, the file name is NFD normalized, whereas the text within the file (most importantly the name of the module) is NFC normalized.
module Pokémon where
x = 1
Note that when cabal init
is populating the exposed-modules
field, it doesn't venture into the Haskell files themselves to pull out module names. It looks at file names and directory names, trusting that the relative path to a file will match the name of the module declared within it, except with periods in place of slashes.
If I understand correctly, when cabal build
, cabal new-build
, or cabal v2-build
is run, the name of the module is expected to match the name of the file exactly - which won't happen if the module's name and the file's name are normalized differently.
I managed to work around this problem for my local cabal
by changing a definition in the where
clause of the function scanForModulesIn
, located in the module Distribution.Client.Init.Heuristics
of the cabal-install
project.
I changed
entries <- getDirectoryContents (projectRoot </> dir)
to
entries <- fmap (map (T.unpack . normalize NFKC . T.pack))
(getDirectoryContents (projectRoot </> dir))
Above, unpack
and pack
are from Data.Text
, while normalize
is from Data.Text.Normalize
in the unicode-transforms
package.
Since this has been marked as a bug now, if the fix above is acceptable (in its current form it adds unicode-transforms
as a dependency) I can make a pull request.
I'm not sure that's a correct fix. I don't understand why getDirectoryContents
get differently normalized contents.
Without proper understanding when we fix macOS we might break Windows or Linux, so fix of this should have properly.
Also there is: https://github.com/haskell/tar/issues/6 so I suspect that may cause some problems too (or is the problem?)
@emilypi could be this be resolved by #7344?
@jneira this was explicitly left off that particular ticket, because we weren't sure if it was completely solved. However if someone were to confirm that we did in fact fix this I would be fine with saying it's closed. A regression test for this would be enough for me to make that call
Describe the bug Consider a folder hierarchy that looks like
Running
cabal init --interactive
when the working directory isproglet
generates the files necessary forcabal v2-build
.Running
cabal v2-build
, however, gives the unexpected errorTo Reproduce
proglet
proglet/Typeclass
proglet/Typeclass/Pokémon.hs
with contentproglet
cabal init --interactive
and complete the prompts this waycabal v2-build
and observe the errorExpected behavior I expected the build to succeed.
System information
cabal
version: 3.0.0.0ghc
version: 8.6.5Additional context It seems that the "e with acute accent" under "Saw" is the UTF-8 character U+00E9, while the "e with acute accent" under "Expected" is a combination of the normal letter "e" (U+0065) and some acute accent character.
Since the "e with acute accent" under "Expected" corresponds to the contents of the
library:exposed-modules
section of the.cabal
file, I checkedproglet.cabal
(I have removed the comment lines below).I changed
Typeclass.Pokémon
in the file above toTypeclass.Pokémon
, where the latter actually uses U+00E9, and thencabal build
worked painlessly.