Open ndmitchell opened 8 years ago
@ndmitchell Any news on this front? Is the intention to have strongly typed paths eventually?
No progress on this front at all, not looked at it at all. There is no intention to change the API with this work - it's about cleaning up the core and ensuring consistency.
@ndmitchell I came up with the following. This is for windows only for now, doesn't use a real parser since it's rather small and passes the display (parse x) === x
property. Thoughts?
data Lexeme = NS NameSpace
| Disk Char
| Device String
| Share String
| Separators [Char]
| FileName String
deriving Show
data NameSpace = FileNameSpace
| DeviceNameSpace
| NTNameSpace
deriving Show
-- | Parse a filepath into lexemes.
--
-- > display (parse x) === x
parse :: String -> [Lexeme]
parse fp'
-- '\\?\UNC\share\path'
| Just (lx, rest) <- parseExtendedUNC fp' = lx ++ parsePath rest
-- '\\?\C:\path', '\\.\C:\path', '\??\C:\path'
| Just (lx1, r1) <- parseNameSpace fp'
, Just (lx2, r2) <- parseDisk r1 = lx1 ++ lx2 ++ parsePath r2
-- '\\.\COM1'
| Just (lx1@(NS DeviceNameSpace:_), r1) <- parseNameSpace fp'
, Just (lx2, r2) <- parseDevice r1 = lx1 ++ lx2 ++ parsePath r2
-- '\\?\some\other', '\\.\some\other', '\??\some\other'
-- detects no device/disk
| Just (lx, rest) <- parseNameSpace fp' = lx ++ parsePath rest
-- 'C:\path'
| Just (lx, rest) <- parseDisk fp' = lx ++ parsePath rest
-- '\\share\path'
| Just (lx, rest) <- parseDriveShare fp' = lx ++ parsePath rest
-- 'relative\path' and everything else
| otherwise = parsePath fp'
where
parsePath :: String -> [Lexeme]
parsePath fp =
case parseFileName fp <|> parseSeparators fp of
Nothing -> []
Just (lx, []) -> lx
Just (lx, rest) -> lx ++ parsePath rest
parseFileName :: String -> Maybe ([Lexeme], String)
parseFileName fp =
case break isPathSeparator fp of
([], _) -> Nothing
(a, r) -> Just ([FileName a], r)
parseSeparators :: String -> Maybe ([Lexeme], String)
parseSeparators fp
| null a = Nothing
| otherwise = Just ([Separators a], b)
where (a, b) = span isPathSeparator fp
-- prefix only
parseExtendedUNC :: String -> Maybe ([Lexeme], String)
parseExtendedUNC ('\\':'\\' :'?':'\\':'U':'N':'C':s4:r1) | isPathSeparator s4 = do
(lx, r2) <- parseSeparators (s4:r1)
(share, r3) <- parseDriveShareName r2
pure ([NS FileNameSpace] ++ lx ++ share, r3)
parseExtendedUNC _ = Nothing
-- prefix only
parseNameSpace :: String -> Maybe ([Lexeme], String)
parseNameSpace ('\\':'?' :'?':'\\':rest) = Just ([NS NTNameSpace ], rest)
parseNameSpace ('\\':'\\':'?':'\\':rest) = Just ([NS FileNameSpace ], rest)
parseNameSpace ('\\':'\\':'.':'\\':rest) = Just ([NS DeviceNameSpace], rest)
parseNameSpace _ = Nothing
-- prefix or after any NameSpace
parseDisk :: String -> Maybe ([Lexeme], String)
parseDisk (x:':':rest) | isLetter x = Just ([Disk x], rest)
parseDisk _ = Nothing
parseDevice :: String -> Maybe ([Lexeme], String)
parseDevice fp =
case break isPathSeparator fp of
([], _) -> Nothing
(a, r) -> Just ([Device a], r)
-- \\sharename\
parseDriveShare :: String -> Maybe ([Lexeme], String)
parseDriveShare (s1:s2:xs) | isPathSeparator s1 && isPathSeparator s2 = do
(lx, rest) <- parseDriveShareName xs
pure ([Separators [s1, s2]] ++ lx, rest)
parseDriveShare _ = Nothing
-- assume you have already seen \\
-- share\bob -> "share\", "bob"
parseDriveShareName :: String -> Maybe ([Lexeme], String)
parseDriveShareName name = do
case break isPathSeparator name of
([], _) -> Nothing
(a, b) -> let (lx, r) = optional b parseSeparators
in Just ([Share a] ++ lx, r)
optional :: String -> (String -> Maybe ([Lexeme], String)) -> ([Lexeme], String)
optional input parser = maybe ([], input) id (parser input)
display :: [Lexeme] -> String
display lxs = case lxs of
(ns@(NS FileNameSpace):sep@(Separators _):share@(Share _):rest) ->
d ns ++ "UNC" ++ d sep ++ d share ++ display rest
other -> concatMap d other
where
d :: Lexeme -> String
d (NS FileNameSpace) = "\\\\?\\"
d (NS DeviceNameSpace) = "\\\\.\\"
d (NS NTNameSpace) = "\\??\\"
d (Disk c) = c:":"
d (Device dev) = dev
d (Share share) = share
d (Separators sep) = sep
d (FileName fn) = fn
@hasufell that looks great and handles all the paths I'd expected. The only question I have is how does it handle /
as a separator. /
is a valid path separator in a on-namespaced path but isn't in a namespace. i.e \\?\C:\foo/bar
refers to a file called foo/bar
.
Also /foo\bar
is a valid relative path, leading /
are relative on Windows.
This next one is arcane so feel free not to support it, but on a non-hierarchical path the \
after the :
for the drive letter is optional. So c:foo.txt
and c:foo
are valid but c:foo\bar.txt
and c:foo\bar
aren't.
Lastly I'll leave it up to you whether to support file streams or not, they a they aren't very common https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams
@Mistuke
Lastly I'll leave it up to you whether to support file streams or not, they a they aren't very common https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams
That actually makes sense in the light of a security vulnerability, which broke the handling of file extensions.
E.g. current filepath behavior is this (which is incorrect):
> splitExtension "lol.txt::$DATA"
("lol",".txt::$DATA")
There'll be some questions on how to preserve the uncurry addExtension (splitExtension x) == x'
property.
This next one is arcane so feel free not to support it, but on a non-hierarchical path the \ after the : for the drive letter is optional. So c:foo.txt and c:foo are valid but c:foo\bar.txt and c:foo\bar aren't.
Right... but the lexer should probably parse it anyway. And then we reject it in e.g. isValid
?
EDIT: I tested it and c:foo\bar.txt
is the same as foo\bar.txt
on my windows 10.
/ is a valid path separator in a on-namespaced path but isn't in a namespace. i.e \?\C:\foo/bar refers to a file called foo/bar.
Right... but it seems both separators are valid in \\.\C:\foo/bar
, so it seems the only difference between \\?\
and \\.\
is that the latter does do expansion etc. I'm not sure about \??\
(see below).
Other things I'm not sure about/still testing:
\??\C:
throws invalid argument
(tried readFile/writeFile/listDirectory, ...), although the WinObj manager shows under GLOBAL??
that this symlink exists (also see this SU answer: https://superuser.com/a/1096784/854039)... I can access all these symlinks via \\?\
and \\.\
, though, which is rather confusingDevice
lexeme, which is supposed to parse COM1
in '\\.\COM1'
... seems to also work with file namespace \\?\COM1
. Generally, I can access all the symlinks from the WinObj manager with these two. So it seems we can expect either a disk char or a device string after a namespace prefix.'\\.\COM1\lol\bar'
still seem to correctly access the COM1
port.../
C:
)COM1
)\\localhost
)
- NT object namespace doesn't work at all for me, e.g.
\??\C:
throwsinvalid argument
(tried readFile/writeFile/listDirectory, ...), although the WinObj manager shows underGLOBAL??
that this symlink exists (also see this SU answer: https://superuser.com/a/1096784/854039)... I can access all these symlinks via\\?\
and\\.\
, though, which is rather confusing- The use of
Device
lexeme, which is supposed to parseCOM1
in'\\.\COM1'
... seems to also work with file namespace\\?\COM1
. Generally, I can access all the symlinks from the WinObj manager with these two. So it seems we can expect either a disk char or a device string after a namespace prefix.
I have digged deeper into this and it appears that:
\\?\
, \\.\
and \??\
all access the same NT object namespace! (here's a picture showing the store https://imgur.com/a/l5KlE38)\??\
is broken under the directory
package, but works with the Win32
API\\?\UNC
is in no way special syntax, it's just another symlink under the GLOBAL??
namespaceGLOBAL??
namespace and write something like \\?\GLOBALROOT\GLOBAL??\UNC\localhost\c$\foo\bar
... which is equivalent to \\localhost\c$\foo\bar
. Or even "\\\\?\\GLOBALROOT\\Device\\Harddisk0\\Partition2\\temp\\lol"
As such... I'm not sure there's a reasonable way to define what a Drive is in light of these possible expressions.
Also relevant: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getvolumepathnamew
Split-drive behavior comparison to python:
python:
"\\localhost\c$\foo\bar" -> ("\\localhost\c$", "\foo\bar")
"\\?\C:\foo\bar" -> ("\\?\C:", "\foo\bar")
"\\.\C:\foo\bar" -> ("\\.\C:", "\foo\bar")
"\\?\UNC\localhost\c$\foo\bar" -> ("\\?\UNC", "\localhost\c$\foo\bar")
"\\.\UNC\localhost\c$\foo\bar" -> ("\\.\UNC", "\localhost\c$\foo\bar")
Haskell:
"\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar")
"\\?\C:\foo\bar" -> ("\\?\C:\", "foo\bar")
"\\.\C:\foo\bar" -> ("\\.\C:\", "foo\bar")
"\\?\UNC\localhost\c$\foo\bar" -> ("\\?\\UNC\localhost\", "c$\foo\bar")
"\\.\UNC\localhost\c$\foo\bar" -> ("\\.\","UNC\localhost\c$\foo\bar")
There'll be some questions on how to preserve the uncurry addExtension (splitExtension x) == x' property.
Probably can't be maintained for non-default streams.
EDIT: I tested it and c:foo\bar.txt is the same as foo\bar.txt on my windows 10.
Indeed, I was wrong, forgot that it becomes a relative path. So they're all valid. Just not absolute paths.
Right... but it seems both separators are valid in \.\C:\foo/bar, so it seems the only difference between \?\ and \.\ is that the latter does do expansion etc.
Yes with valid I didn't mean you get an error, I meant it doesn't return what a user might expect. It also returns something the user can't remove using any normal OS tools like Explorer.
Hmm I don't think the \\?\
namespace has no obligation to respect reserved names, while \\.\
does. So devices should really only be used with \\.\
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#namespaces
\?\, \.\ and \??\ all access the same NT object namespace! (here's a picture showing the store https://imgur.com/a/l5KlE38)
Yes, eventually they all access the same kernel objects, but the way there differs.
\?\UNC is in no way special syntax, it's just another symlink under the GLOBAL?? namespace
It's only special in that it's to replace the definition of the original \\
syntax for network access. While the implementation is just a symlink, there's no other way to access a UNC path without it. So it is still in affect special. You probably want to treat it special such that you'd be able to distinguish between a local path or a network one.
you can escape the GLOBAL?? namespace and write something like \?\GLOBALROOT\GLOBAL??\UNC\localhost\c$\foo\bar... which is equivalent to \localhost\c$\foo\bar. Or even "\\?\GLOBALROOT\Device\Harddisk0\Partition2\temp\lol"
Or Volume{..}
, etc, but this doesn't change the behavior of the lexer does it?
As such... I'm not sure there's a reasonable way to define what a Drive is in light of these possible expressions.
I think a drive should stay what 99% of users expect, namely just the user level object mappings for local paths. e.g. C:\
for C:\fooo
and perhaps for \\?\C:\
.
I would expect most users to use the well known PATH syntax and these differences are only important for GHC or specialized program where I assume the programmer knows what they're doing and so wouldn't call
Split-drive behavior comparison to python:
python: "\\localhost\c$\foo\bar" -> ("\\localhost\c$", "\foo\bar") "\\?\C:\foo\bar" -> ("\\?\C:", "\foo\bar") "\\.\C:\foo\bar" -> ("\\.\C:", "\foo\bar")
The one makes more sense to me than the Haskell one,
"\?\UNC\localhost\c$\foo\bar" -> ("\?\UNC", "\localhost\c$\foo\bar") "\.\UNC\localhost\c$\foo\bar" -> ("\.\UNC", "\localhost\c$\foo\bar")
These don't make sense to me, I would have expected \\?\\UNC\localhost\c$
as that's the actual "root" of the object you need to access.
Haskell: "\localhost\c$\foo\bar" -> ("\localhost\", "c$\foo\bar") "\?\C:\foo\bar" -> ("\?\C:\", "foo\bar") "\.\C:\foo\bar" -> ("\.\C:\", "foo\bar") "\?\UNC\localhost\c$\foo\bar" -> ("\?\UNC\localhost\", "c$\foo\bar") "\.\UNC\localhost\c$\foo\bar" -> ("\.\","UNC\localhost\c$\foo\bar")
hmm somewhat inconsistent
Probably can't be maintained for non-default streams.
What if we do this:
> splitExtension "lol.txt::$DATA"
("lol::$DATA",".txt")
> addExtension "lol::$DATA" ".txt"
"lol.txt::$DATA"
I think that's somewhat consistent. So everything after the first :
is considered a Data stream identifier.
Yes with valid I didn't mean you get an error, I meant it doesn't return what a user might expect. It also returns something the user can't remove using any normal OS tools like Explorer.
Hmm I don't think the \?\ namespace has no obligation to respect reserved names, while \.\ does. So devices should really only be used with \.\ https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#namespaces
To clarify: createDirectory "\\\\.\\C:\\foo/bar"
creates the directory C:\foo\bar
here... that's what I meant. So \\.\
seems to allow path interpretation by the windows API. As a result, I'd only ignore unix /
path separators for file namespace \\?\
.
The rest of the intricacies can be left to the user, who chose to go down this path.
Or Volume{..}, etc, but this doesn't change the behavior of the lexer does it?
Well, yes potentially... because the question is what we consider a Drive.
I think there are a couple of choices:
\\.\C:\foo\bar
-> \\.\C:\
\\.\UNC\localhost\c$\foo\bar
-> \\.\UNC\
\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar
-> \\.\GLOBALROOT\
UNC
specially (probably the most backwards compatible way):
\\.\C:\foo\bar
-> \\.\C:\
\\.\UNC\localhost\c$\foo\bar
-> \\.\UNC\localhost\
\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar
-> \\.\GLOBALROOT\
\\.\C:\foo\bar
-> \\.\C:\
\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\
\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar
-> \\.\GLOBALROOT\Device\Harddisk0\Partition2\
The one makes more sense to me than the Haskell one,
Yes, I think that our current "\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar")
is somewhat confusing, but fixing that would be a breaking change.
What if we do this:
> splitExtension "lol.txt::$DATA" ("lol::$DATA",".txt") > addExtension "lol::$DATA" ".txt" "lol.txt::$DATA"
I think that's somewhat consistent. So everything after the first
:
is considered a Data stream identifier.
That makes some sense, but needs to be documented, it may not be what one intuitively expects.
I think there are a couple of choices:
1. only ever consider the first path component as the drive (the simplest, will be a breaking change): * `\\.\C:\foo\bar` -> `\\.\C:\` * `\\.\UNC\localhost\c$\foo\bar` -> `\\.\UNC\` * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\` 2. same as 1., except that we treat `UNC` specially (probably the most backwards compatible way): * `\\.\C:\foo\bar` -> `\\.\C:\` * `\\.\UNC\localhost\c$\foo\bar` -> `\\.\UNC\localhost\` * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\` 3. try to interpret as much as possible (we'll probably never be exhaustive): * `\\.\C:\foo\bar` -> `\\.\C:\` * `\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\` * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\Device\Harddisk0\Partition2\`
I find 3. the most logical. Handling the common cases should get the most bang for bucks. I would expect very little, if any programs to use \\.\GLOBALROOT\
etc.
But I personally think that it should treat the first $
the same as the first :
. That seems conceptually more correct, even if a
non-backwards compatible change.
> * `\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\`
To me says Store in the foo\bar
folder on the C
drive on localhost
. So logically I expect the "drive" to contain c$
since
drive + newpath should still make sense and more consistent with \\.\C:\foo\bar
. I really consider the old behavior a bug here.
With this inconsistency a user would need to know whether the patch is a local or UNC one when they call something that returns the drive, in order to get the c$
as well.
The one makes more sense to me than the Haskell one,
Yes, I think that our current
"\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar")
is somewhat confusing, but fixing that would be a breaking change.
Wouldn't it make more sense to release it as a major version and take the breaking change? or a new name for the new behavior? Could always ask what people think about it.
Alright... I tried to come up with a lax grammar around windows filepaths, based on the UNC grammar from MS docs: https://gist.github.com/hasufell/ce7c18ef3ad44d1d59b548206cdcb59d
For info, @hasufell, I never expected this to use a real parser (and for dependency reasons it probably won't) - your notes seem plausible along the lines of what I was thinking, but if the ADT parser works better that seems reasonable too. This was more a direction of travel I thought would be good than precise thoughts.
For info, @hasufell, I never expected this to use a real parser (and for dependency reasons it probably won't) - your notes seem plausible along the lines of what I was thinking, but if the ADT parser works better that seems reasonable too. This was more a direction of travel I thought would be good than precise thoughts.
I just handrolled a parser that will work for both [Char]
and [Word8]
/[Word16]
(the latter will be relevant for the AFPP), see https://github.com/haskell/filepath/pull/99/files
Pulling out from https://github.com/haskell/filepath/issues/12#issuecomment-64000192, I think most path/drive based functions (but not extension functions) should first parse their path, then modify, then render it. That gives a much smaller "trusted core" of FilePath.
We would rely on the property
display . parse == id
, so likely UNC would need extending to say which type of separators it was, whether it had?
etc.