haskell / filepath

Haskell FilePath core library
BSD 3-Clause "New" or "Revised" License
66 stars 32 forks source link

Reimplement path functions around ADT #53

Open ndmitchell opened 8 years ago

ndmitchell commented 8 years ago

Pulling out from https://github.com/haskell/filepath/issues/12#issuecomment-64000192, I think most path/drive based functions (but not extension functions) should first parse their path, then modify, then render it. That gives a much smaller "trusted core" of FilePath.

data Lexeme = Drive Char | UNC String | Separator Char | Path String

parse :: String -> [Lexeme]

display :: [Lexeme] -> String
display = concatMap $ \x -> case x of
    Drive x -> [x,':'] -- x will be an ASCII character
    UNC x -> '\\':'\\':x -- x will be a UNC name, not containing any pathSeparators
    Separator x -> x -- x will be a member of pathSeparators
    Path x -> x -- x will not contain any pathSeparators

We would rely on the property display . parse == id, so likely UNC would need extending to say which type of separators it was, whether it had ? etc.

SwiftsNamesake commented 7 years ago

@ndmitchell Any news on this front? Is the intention to have strongly typed paths eventually?

ndmitchell commented 7 years ago

No progress on this front at all, not looked at it at all. There is no intention to change the API with this work - it's about cleaning up the core and ensuring consistency.

hasufell commented 2 years ago

@ndmitchell I came up with the following. This is for windows only for now, doesn't use a real parser since it's rather small and passes the display (parse x) === x property. Thoughts?

data Lexeme = NS NameSpace
            | Disk Char
            | Device String
            | Share String
            | Separators [Char]
            | FileName String
  deriving Show

data NameSpace = FileNameSpace
               | DeviceNameSpace
               | NTNameSpace
  deriving Show

-- | Parse a filepath into lexemes.
--
-- > display (parse x) === x
parse :: String -> [Lexeme]
parse fp'
  -- '\\?\UNC\share\path'
  | Just (lx, rest) <- parseExtendedUNC fp' = lx         ++ parsePath rest
  -- '\\?\C:\path', '\\.\C:\path', '\??\C:\path'
  | Just (lx1, r1)  <- parseNameSpace fp'
  , Just (lx2, r2)  <- parseDisk r1         = lx1 ++ lx2 ++ parsePath r2
  -- '\\.\COM1'
  | Just (lx1@(NS DeviceNameSpace:_), r1) <- parseNameSpace fp'
  , Just (lx2, r2)  <- parseDevice r1       = lx1 ++ lx2 ++ parsePath r2
  -- '\\?\some\other', '\\.\some\other', '\??\some\other'
  -- detects no device/disk
  | Just (lx, rest) <- parseNameSpace fp'   = lx         ++ parsePath rest
  -- 'C:\path'
  | Just (lx, rest) <- parseDisk fp'        = lx         ++ parsePath rest
  -- '\\share\path'
  | Just (lx, rest) <- parseDriveShare fp'  = lx         ++ parsePath rest
  -- 'relative\path' and everything else
  | otherwise = parsePath fp'
 where
  parsePath :: String -> [Lexeme]
  parsePath fp = 
    case parseFileName fp <|> parseSeparators fp of
      Nothing         -> []
      Just (lx, [])   -> lx
      Just (lx, rest) -> lx ++ parsePath rest

  parseFileName :: String -> Maybe ([Lexeme], String)
  parseFileName fp =
    case break isPathSeparator fp of
      ([], _) -> Nothing
      (a, r)  -> Just ([FileName a], r)

  parseSeparators :: String -> Maybe ([Lexeme], String)
  parseSeparators fp
   | null a = Nothing
   | otherwise = Just ([Separators a], b)
      where (a, b) = span isPathSeparator fp

  -- prefix only
  parseExtendedUNC :: String -> Maybe ([Lexeme], String)
  parseExtendedUNC ('\\':'\\' :'?':'\\':'U':'N':'C':s4:r1) | isPathSeparator s4 = do
    (lx, r2) <- parseSeparators (s4:r1)
    (share, r3) <- parseDriveShareName r2
    pure ([NS FileNameSpace] ++ lx ++ share, r3)
  parseExtendedUNC _ = Nothing

  -- prefix only
  parseNameSpace :: String -> Maybe ([Lexeme], String)
  parseNameSpace ('\\':'?' :'?':'\\':rest) = Just ([NS NTNameSpace    ], rest)
  parseNameSpace ('\\':'\\':'?':'\\':rest) = Just ([NS FileNameSpace  ], rest)
  parseNameSpace ('\\':'\\':'.':'\\':rest) = Just ([NS DeviceNameSpace], rest)
  parseNameSpace _                         = Nothing

  -- prefix or after any NameSpace
  parseDisk :: String -> Maybe ([Lexeme], String)
  parseDisk (x:':':rest) | isLetter x = Just ([Disk x], rest)
  parseDisk _ = Nothing

  parseDevice :: String -> Maybe ([Lexeme], String)
  parseDevice fp =
    case break isPathSeparator fp of
      ([], _) -> Nothing
      (a, r)  -> Just ([Device a], r)

  -- \\sharename\
  parseDriveShare :: String -> Maybe ([Lexeme], String)
  parseDriveShare (s1:s2:xs) | isPathSeparator s1 && isPathSeparator s2 = do
    (lx, rest) <- parseDriveShareName xs
    pure ([Separators [s1, s2]] ++ lx, rest)
  parseDriveShare _ = Nothing

  -- assume you have already seen \\
  -- share\bob -> "share\", "bob"
  parseDriveShareName :: String -> Maybe ([Lexeme], String)
  parseDriveShareName name = do
    case break isPathSeparator name of
      ([], _) -> Nothing
      (a, b)  -> let (lx, r) = optional b parseSeparators
                 in Just ([Share a] ++ lx, r)

  optional :: String -> (String -> Maybe ([Lexeme], String)) -> ([Lexeme], String)
  optional input parser = maybe ([], input) id (parser input)

display :: [Lexeme] -> String
display lxs = case lxs of
                (ns@(NS FileNameSpace):sep@(Separators _):share@(Share _):rest) -> 
                  d ns ++ "UNC" ++ d sep ++ d share ++ display rest
                other -> concatMap d other
 where
  d :: Lexeme -> String
  d (NS FileNameSpace)   = "\\\\?\\"
  d (NS DeviceNameSpace) = "\\\\.\\"
  d (NS NTNameSpace)     = "\\??\\"
  d (Disk c)             = c:":"
  d (Device dev)         = dev
  d (Share share)        = share
  d (Separators sep)     = sep
  d (FileName fn)        = fn
Mistuke commented 2 years ago

@hasufell that looks great and handles all the paths I'd expected. The only question I have is how does it handle / as a separator. / is a valid path separator in a on-namespaced path but isn't in a namespace. i.e \\?\C:\foo/bar refers to a file called foo/bar.

Also /foo\bar is a valid relative path, leading / are relative on Windows.

This next one is arcane so feel free not to support it, but on a non-hierarchical path the \ after the : for the drive letter is optional. So c:foo.txt and c:foo are valid but c:foo\bar.txt and c:foo\bar aren't.

Lastly I'll leave it up to you whether to support file streams or not, they a they aren't very common https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams

hasufell commented 2 years ago

@Mistuke

Lastly I'll leave it up to you whether to support file streams or not, they a they aren't very common https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams

That actually makes sense in the light of a security vulnerability, which broke the handling of file extensions.

E.g. current filepath behavior is this (which is incorrect):

> splitExtension "lol.txt::$DATA"
("lol",".txt::$DATA")

There'll be some questions on how to preserve the uncurry addExtension (splitExtension x) == x' property.


This next one is arcane so feel free not to support it, but on a non-hierarchical path the \ after the : for the drive letter is optional. So c:foo.txt and c:foo are valid but c:foo\bar.txt and c:foo\bar aren't.

Right... but the lexer should probably parse it anyway. And then we reject it in e.g. isValid?

EDIT: I tested it and c:foo\bar.txt is the same as foo\bar.txt on my windows 10.


/ is a valid path separator in a on-namespaced path but isn't in a namespace. i.e \?\C:\foo/bar refers to a file called foo/bar.

Right... but it seems both separators are valid in \\.\C:\foo/bar, so it seems the only difference between \\?\ and \\.\ is that the latter does do expansion etc. I'm not sure about \??\ (see below).


Other things I'm not sure about/still testing:

  1. NT object namespace doesn't work at all for me, e.g. \??\C: throws invalid argument (tried readFile/writeFile/listDirectory, ...), although the WinObj manager shows under GLOBAL?? that this symlink exists (also see this SU answer: https://superuser.com/a/1096784/854039)... I can access all these symlinks via \\?\ and \\.\, though, which is rather confusing
  2. The use of Device lexeme, which is supposed to parse COM1 in '\\.\COM1'... seems to also work with file namespace \\?\COM1. Generally, I can access all the symlinks from the WinObj manager with these two. So it seems we can expect either a disk char or a device string after a namespace prefix.
  3. Silly paths like '\\.\COM1\lol\bar' still seem to correctly access the COM1 port...
  4. Generally our notion of drive is this (correct?):
    • on linux, there's only one drive, which is the root folder /
    • on windows, a drive may be either:
      • a Disk char (e.g. C:)
      • a device string (e.g. COM1)
      • a share name (e.g. \\localhost)
  5. Do any absolute paths exist that don't have a drive as described in 4.?
hasufell commented 2 years ago
  1. NT object namespace doesn't work at all for me, e.g. \??\C: throws invalid argument (tried readFile/writeFile/listDirectory, ...), although the WinObj manager shows under GLOBAL?? that this symlink exists (also see this SU answer: https://superuser.com/a/1096784/854039)... I can access all these symlinks via \\?\ and \\.\, though, which is rather confusing
  2. The use of Device lexeme, which is supposed to parse COM1 in '\\.\COM1'... seems to also work with file namespace \\?\COM1. Generally, I can access all the symlinks from the WinObj manager with these two. So it seems we can expect either a disk char or a device string after a namespace prefix.

I have digged deeper into this and it appears that:

  1. \\?\, \\.\ and \??\ all access the same NT object namespace! (here's a picture showing the store https://imgur.com/a/l5KlE38)
  2. \??\ is broken under the directory package, but works with the Win32 API
  3. There are only subtle differences between the three forms, e.g. described in https://stackoverflow.com/a/25099634
  4. \\?\UNC is in no way special syntax, it's just another symlink under the GLOBAL?? namespace
  5. you can escape the GLOBAL?? namespace and write something like \\?\GLOBALROOT\GLOBAL??\UNC\localhost\c$\foo\bar... which is equivalent to \\localhost\c$\foo\bar. Or even "\\\\?\\GLOBALROOT\\Device\\Harddisk0\\Partition2\\temp\\lol"

As such... I'm not sure there's a reasonable way to define what a Drive is in light of these possible expressions.

Also relevant: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getvolumepathnamew

hasufell commented 2 years ago

Split-drive behavior comparison to python:

python:
  "\\localhost\c$\foo\bar"       -> ("\\localhost\c$", "\foo\bar")
  "\\?\C:\foo\bar"               -> ("\\?\C:", "\foo\bar")
  "\\.\C:\foo\bar"               -> ("\\.\C:", "\foo\bar")
  "\\?\UNC\localhost\c$\foo\bar" -> ("\\?\UNC", "\localhost\c$\foo\bar")
  "\\.\UNC\localhost\c$\foo\bar" -> ("\\.\UNC", "\localhost\c$\foo\bar")

Haskell:
  "\\localhost\c$\foo\bar"       -> ("\\localhost\", "c$\foo\bar")
  "\\?\C:\foo\bar"               -> ("\\?\C:\", "foo\bar")
  "\\.\C:\foo\bar"               -> ("\\.\C:\", "foo\bar")
  "\\?\UNC\localhost\c$\foo\bar" -> ("\\?\\UNC\localhost\", "c$\foo\bar")
  "\\.\UNC\localhost\c$\foo\bar" -> ("\\.\","UNC\localhost\c$\foo\bar")
Mistuke commented 2 years ago

There'll be some questions on how to preserve the uncurry addExtension (splitExtension x) == x' property.

Probably can't be maintained for non-default streams.

EDIT: I tested it and c:foo\bar.txt is the same as foo\bar.txt on my windows 10.

Indeed, I was wrong, forgot that it becomes a relative path. So they're all valid. Just not absolute paths.

Right... but it seems both separators are valid in \.\C:\foo/bar, so it seems the only difference between \?\ and \.\ is that the latter does do expansion etc.

Yes with valid I didn't mean you get an error, I meant it doesn't return what a user might expect. It also returns something the user can't remove using any normal OS tools like Explorer.

Hmm I don't think the \\?\ namespace has no obligation to respect reserved names, while \\.\ does. So devices should really only be used with \\.\ https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#namespaces

\?\, \.\ and \??\ all access the same NT object namespace! (here's a picture showing the store https://imgur.com/a/l5KlE38)

Yes, eventually they all access the same kernel objects, but the way there differs.

\?\UNC is in no way special syntax, it's just another symlink under the GLOBAL?? namespace

It's only special in that it's to replace the definition of the original \\ syntax for network access. While the implementation is just a symlink, there's no other way to access a UNC path without it. So it is still in affect special. You probably want to treat it special such that you'd be able to distinguish between a local path or a network one.

you can escape the GLOBAL?? namespace and write something like \?\GLOBALROOT\GLOBAL??\UNC\localhost\c$\foo\bar... which is equivalent to \localhost\c$\foo\bar. Or even "\\?\GLOBALROOT\Device\Harddisk0\Partition2\temp\lol"

Or Volume{..}, etc, but this doesn't change the behavior of the lexer does it?

As such... I'm not sure there's a reasonable way to define what a Drive is in light of these possible expressions.

I think a drive should stay what 99% of users expect, namely just the user level object mappings for local paths. e.g. C:\ for C:\fooo and perhaps for \\?\C:\.

I would expect most users to use the well known PATH syntax and these differences are only important for GHC or specialized program where I assume the programmer knows what they're doing and so wouldn't call

Split-drive behavior comparison to python:


python:
  "\\localhost\c$\foo\bar"       -> ("\\localhost\c$", "\foo\bar")
  "\\?\C:\foo\bar"               -> ("\\?\C:", "\foo\bar")
  "\\.\C:\foo\bar"               -> ("\\.\C:", "\foo\bar")

The one makes more sense to me than the Haskell one,

"\?\UNC\localhost\c$\foo\bar" -> ("\?\UNC", "\localhost\c$\foo\bar") "\.\UNC\localhost\c$\foo\bar" -> ("\.\UNC", "\localhost\c$\foo\bar")

These don't make sense to me, I would have expected \\?\\UNC\localhost\c$ as that's the actual "root" of the object you need to access.

Haskell: "\localhost\c$\foo\bar" -> ("\localhost\", "c$\foo\bar") "\?\C:\foo\bar" -> ("\?\C:\", "foo\bar") "\.\C:\foo\bar" -> ("\.\C:\", "foo\bar") "\?\UNC\localhost\c$\foo\bar" -> ("\?\UNC\localhost\", "c$\foo\bar") "\.\UNC\localhost\c$\foo\bar" -> ("\.\","UNC\localhost\c$\foo\bar")

hmm somewhat inconsistent

hasufell commented 2 years ago

Probably can't be maintained for non-default streams.

What if we do this:

> splitExtension "lol.txt::$DATA"
("lol::$DATA",".txt")
> addExtension "lol::$DATA" ".txt"
"lol.txt::$DATA"

I think that's somewhat consistent. So everything after the first : is considered a Data stream identifier.

Yes with valid I didn't mean you get an error, I meant it doesn't return what a user might expect. It also returns something the user can't remove using any normal OS tools like Explorer.

Hmm I don't think the \?\ namespace has no obligation to respect reserved names, while \.\ does. So devices should really only be used with \.\ https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#namespaces

To clarify: createDirectory "\\\\.\\C:\\foo/bar" creates the directory C:\foo\bar here... that's what I meant. So \\.\ seems to allow path interpretation by the windows API. As a result, I'd only ignore unix / path separators for file namespace \\?\.

The rest of the intricacies can be left to the user, who chose to go down this path.

Or Volume{..}, etc, but this doesn't change the behavior of the lexer does it?

Well, yes potentially... because the question is what we consider a Drive.

I think there are a couple of choices:

  1. only ever consider the first path component as the drive (the simplest, will be a breaking change):
    • \\.\C:\foo\bar -> \\.\C:\
    • \\.\UNC\localhost\c$\foo\bar -> \\.\UNC\
    • \\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar -> \\.\GLOBALROOT\
  2. same as 1., except that we treat UNC specially (probably the most backwards compatible way):
    • \\.\C:\foo\bar -> \\.\C:\
    • \\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\
    • \\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar -> \\.\GLOBALROOT\
  3. try to interpret as much as possible (we'll probably never be exhaustive):
    • \\.\C:\foo\bar -> \\.\C:\
    • \\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\
    • \\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar -> \\.\GLOBALROOT\Device\Harddisk0\Partition2\

The one makes more sense to me than the Haskell one,

Yes, I think that our current "\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar") is somewhat confusing, but fixing that would be a breaking change.

Mistuke commented 2 years ago

What if we do this:

> splitExtension "lol.txt::$DATA"
("lol::$DATA",".txt")
> addExtension "lol::$DATA" ".txt"
"lol.txt::$DATA"

I think that's somewhat consistent. So everything after the first : is considered a Data stream identifier.

That makes some sense, but needs to be documented, it may not be what one intuitively expects.

I think there are a couple of choices:

1. only ever consider the first path component as the drive (the simplest, will be a breaking change):

   * `\\.\C:\foo\bar` -> `\\.\C:\`
   * `\\.\UNC\localhost\c$\foo\bar` -> `\\.\UNC\`
   * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\`

2. same as 1., except that we treat `UNC` specially (probably the most backwards compatible way):

   * `\\.\C:\foo\bar` -> `\\.\C:\`
   * `\\.\UNC\localhost\c$\foo\bar` -> `\\.\UNC\localhost\`
   * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\`

3. try to interpret as much as possible (we'll probably never be exhaustive):

   * `\\.\C:\foo\bar` -> `\\.\C:\`
   * `\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\`
   * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\Device\Harddisk0\Partition2\`

I find 3. the most logical. Handling the common cases should get the most bang for bucks. I would expect very little, if any programs to use \\.\GLOBALROOT\ etc.

But I personally think that it should treat the first $ the same as the first :. That seems conceptually more correct, even if a non-backwards compatible change.

>        * `\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\`

To me says Store in the foo\bar folder on the C drive on localhost. So logically I expect the "drive" to contain c$ since drive + newpath should still make sense and more consistent with \\.\C:\foo\bar. I really consider the old behavior a bug here.

With this inconsistency a user would need to know whether the patch is a local or UNC one when they call something that returns the drive, in order to get the c$ as well.

The one makes more sense to me than the Haskell one,

Yes, I think that our current "\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar") is somewhat confusing, but fixing that would be a breaking change.

Wouldn't it make more sense to release it as a major version and take the breaking change? or a new name for the new behavior? Could always ask what people think about it.

hasufell commented 2 years ago

Alright... I tried to come up with a lax grammar around windows filepaths, based on the UNC grammar from MS docs: https://gist.github.com/hasufell/ce7c18ef3ad44d1d59b548206cdcb59d

ndmitchell commented 2 years ago

For info, @hasufell, I never expected this to use a real parser (and for dependency reasons it probably won't) - your notes seem plausible along the lines of what I was thinking, but if the ADT parser works better that seems reasonable too. This was more a direction of travel I thought would be good than precise thoughts.

hasufell commented 2 years ago

For info, @hasufell, I never expected this to use a real parser (and for dependency reasons it probably won't) - your notes seem plausible along the lines of what I was thinking, but if the ADT parser works better that seems reasonable too. This was more a direction of travel I thought would be good than precise thoughts.

I just handrolled a parser that will work for both [Char] and [Word8]/[Word16] (the latter will be relevant for the AFPP), see https://github.com/haskell/filepath/pull/99/files