jesicabaxter / ndmitchell

Automatically exported from code.google.com/p/ndmitchell
0 stars 0 forks source link

Shake does not find existing files with UTF8 characters #614

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Shake is not able to find files with UTF8 characters in their names, but the 
function doesFileExist from base returns true.

What steps will reproduce the problem?
1. touch "Test with Ü.txt"
2. in Shakefile: need ["Test with Ü.txt"]
3. runhaskell Shakefile

Example Shakefile:

main = shakeArgs shakeOptions $ do
    want ["result.tar"]
    "*.tar" *> \out -> do
        need ["Test with Ü.txt"]
        content <- readFileLines "Test with Ü.txt"
        need content
        cmd "tar -cf" [out] content

What is the expected output? What do you see instead?

Expected: Shake should build “result.tar”.

Error when running Shake build system:
* result.tar
* Test with Ü.txt
Error, file does not exist and no rule available:
  Test with Ü.txt

What version of the product are you using? On what operating system?
shake-0.10.6 on Linux

Please provide any additional information below.
$ ghci <<EOF
import System.Directory
doesFileExist "Test with Ü.txt" 
EOF
True

Original issue reported on code.google.com by hali...@gmail.com on 16 Jul 2013 at 7:27

GoogleCodeExporter commented 8 years ago
Thanks for the comprehensive error report. Is this something you actually need, 
or were you just testing the feature to see what it would do? It's not quite 
clear how UTF8 strings for filepaths work in all situations, and the Shake 
pieces all use ByteString under the hood for faster lookup, but if this is 
actually something you are encountering in practice I'll take a closer look.

Original comment by ndmitch...@gmail.com on 16 Jul 2013 at 1:26

GoogleCodeExporter commented 8 years ago
I started using Shake for a small project (document generation with LaTeX). 
Because the document is written in German, I used German file names, too. The 
original make handles the situation quite good (because it doesn’t handle 
encoding at all). I would say, that a Make-replacement should work with 
non-ASCII file names. Also all modern file systems are encoding file names as 
UTF8.

In the meantime I tried to track the issue down. You are using 
Data.ByteString.Char8, so that’s should not be a problem. It’s just using 
the raw character codes of the string.

Prelude Main> :m + Data.ByteString.Char8 System.Directory 
Development.Shake.FileTime 
Prelude Data.ByteString.Char8 System.Directory Development.Shake.FileTime Main> 
getModTimeMaybe $ pack "/tmp/ü.txt"
Nothing
Prelude Data.ByteString.Char8 System.Directory Development.Shake.FileTime Main> 
getModificationTime . unpack $ pack "/tmp/ü.txt"
2013-07-17 11:34:35 UTC
Prelude Data.ByteString.Char8 System.Directory Development.Shake.FileTime Main> 
Prelude Data.ByteString.Char8 System.Directory Development.Shake.FileTime Main> 
doesFileExist . unpack $ pack "/tmp/ü.txt"
True

I guess there is something wrong with getModTimeMaybe. I’m a bit confused by 
the #ifdefs in the code and I don’t know which versions is used.

When I #define PORTABLE, everything works as expected.

*Main Data.ByteString.Char8 System.Directory Development.Shake.FileTime> 
getModTimeMaybe $ pack "/tmp/ü.txt"
Just (FileTime 41675)

So most likely the unix library version of the function is used on my system by 
default und this functions is to blame for this behavior.

Original comment by hali...@gmail.com on 17 Jul 2013 at 12:18

GoogleCodeExporter commented 8 years ago
Defining PORTABLE should be a perfectly fine workaround - if you have 1000's of 
files it will go a fraction slower for the nothing to do rebuild, but otherwise 
you won't notice. You should even be able to pass --flags=portable to Cabal at 
configure/install time to do it without editing the source.

I think the problem is that you are using a high-8bit character. I translate it 
from FilePath to ByteString and it remains high-8bit, but when I pass it to the 
unix systems it will require UTF8. The solution is for Shake to store all 
filepaths as UTF8 internally and pass them directly to the unix functions, 
decode properly to FilePath, and convert them to UCS2 for Windows. I'll work on 
such a feature in the next few days. I agree that supporting Unicode is ideal - 
thanks for all the hard work you've done tracking down where the issue lies.

Original comment by ndmitch...@gmail.com on 18 Jul 2013 at 9:12

GoogleCodeExporter commented 8 years ago
I think with PORTABLE and the old code it will work for the first 255 unicode 
points, but nothing higher.

I've just pushed some patches that enable full unicode support on both Windows 
and Linux for all unicode characters, regardless of PORTABLE or not, and have 
included some tests which seem to work. Are you able to grab the latest version 
from git and give it a go? If not, I'll make a release for you to test.

Original comment by ndmitch...@gmail.com on 20 Jul 2013 at 7:40

GoogleCodeExporter commented 8 years ago
Yes, the current git version works as expected. Thank you very much for fixing 
this issue so fast.

Now I’ve found a proper replacement for make and its restricted capabilities. 
Thanks :-)

Original comment by hali...@gmail.com on 20 Jul 2013 at 8:24

GoogleCodeExporter commented 8 years ago
Great, I'll make a release in the next few days - thanks for testing and thanks 
for the suggestion, it is much more powerful to support full unicode.

Original comment by ndmitch...@gmail.com on 20 Jul 2013 at 9:22

GoogleCodeExporter commented 8 years ago
This was fixed a while ago in shake-0.10.7.

Original comment by ndmitch...@gmail.com on 18 Nov 2013 at 5:10