Open hasufell opened 3 years ago
pinging some folks @snoyberg @emilypi @kleidukos @sclv @gbaz @phadej @bgamari
Ccing @Mistuke
Thanks for putting the time into pushing this forward, @hasufell! Some initial thoughts:
AbstractFilePath
rather be a newtype
? Afterall, not all strings are paths. If it were a newtype
we would need to also provide oss :: QuasiQuoter
and associated combinatorsAbstractFilePath
module expected to be imported qualified
? If so, I think that:
(un)?packAFP
should rather be called (un)?pack
(for consistency with bytestring
and text
)joinPath
should just be called join
splitPath
should just be called split
bsToAFP
shouldn't be abbreviated; rather it should be spelled byteStringToAFP
(for consistency with, e.g., the Data.ByteString.Builder
combinatorsmakeValid
are very vague; if we want this operation it should be well-documented and stableOsWord
is a bit confusing since it actually refers to a codepoint (although under POSIX it is a codepoint in the trivial ASCII encoding)fromChar
return a Maybe
? (e.g. what is fromChar 'ᝆ'
under POSIX?)/
in pathSeparators
on Windows? It seems the answer is "yes" but I don't believe that CreateFile
would accept such a path.s/windows/Windows
@bgamari
Should AbstractFilePath rather be a newtype? Afterall, not all strings are paths. If it were a newtype we would need to also provide oss :: QuasiQuoter and associated combinators
I had given this some thought, the options were as follows:
type AbstractFilePath = OsString
(or vice versa). This is similar to type FilePath = String
. The disadvantage is that we lose the type level distinction and it's more like a hint.AbstractFilePath
does not ensure filepath validity anyway (since there is no definite list of what is allowed, only hints. Even the filesystem may have an impact on what is valid. As such we only maintain the encoding). This approach may be more interesting if AbstractFilePath
had more invariants, such as relative vs absolute on type level (compare with Path and hpath libraries).So my argument is: AbstractFilePath
doesn't maintain further restrictions over OsString
and it probably shouldn't (since these are on e.g. windows fuzzy and not strict, depend on filesystem used, possibly windows version etc etc).
This library/proposal is only about encoding correctness and memory efficiency. Having more guarantees about filepaths on type level is out of scope, IMO. A user-space library can provide these things (there are hpath, path and others and they make different trade-offs, the design space isn't small).
Is the AbstractFilePath module expected to be imported qualified?
I have no strong opinion, what's your verdict?
the semantics of makeValid are very vague; if we want this operation it should be well-documented and stable
True. However, most (not all) functions are ported from filepath
: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Posix.html#v:makeValid
the name OsWord is a bit confusing since it actually refers to a codepoint (although under POSIX it is a codepoint in the trivial ASCII encoding)
What name do you suggest?
Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.
Yes and I believe it does. The filepath
package handles it the same way: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Windows.html#v:pathSeparators
Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:
"The name of the file or device to be created or opened. You may use either forward slashes (/) or backslashes (\) in this name."
What representation, if any, are named streams given on Windows?
I'm not too familiar with those? Why are they relevant for filenames? Does the filepath
package handle those somehow?
So my argument is: AbstractFilePath doesn't maintain further restrictions over OsString and it probably shouldn't (since these are on e.g. windows fuzzy and not strict, depend on filesystem used, possibly windows version etc etc).
Fair.
Is the AbstractFilePath module expected to be imported qualified?
I have no strong opinion, what's your verdict?
I think it should be treated as a qualified import.
True. However, most (not all) functions are ported from filepath: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Posix.html#v:makeValid
Fair, but I think we should take this opportunity to clarify the semantics (or at least clearly document that the function will merely make a fuzzy "best-effort" attempt at fixing up the path.
What name do you suggest?
OsChar
perhaps? This slightly conflates characters and codepoints, but this is a fairly common conflation and I would argue that it's clearer than OsWord
.
Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:
Ahh, good catch.
What representation, if any, are named streams given on Windows?
I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?
See https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams. In short, files in Windows are in fact more like directories, containing multiple named "streams" of content (vaguely reminiscent of extended attributes in the POSIX world). The "default" stream is named $DATA
. As far as I know this is a quite rarely used feature so we would probably be fine with ignoring it (as, I believe, filepath
does).
Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.
Yes and I believe it does. The
filepath
package handles it the same way: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Windows.html#v:pathSeparatorsAlso see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:
"The name of the file or device to be created or opened. You may use either forward slashes (/) or backslashes () in this name."
This is only true when using the APIs in "legacy" mode. In "legacy" mode the Win32 API calls will translate paths from their legacy form e.g. C:\Windows\foo
into the form the driver actually understands \\?\C:\Windows\foo
. One of the main differences is that the path is no longer pre-processed by the user-mode part of the Window API[0].
This results in accesses being faster, but also in an application not being restricted to legacy issues (e.g. MAX_PATH
). But this also means various things become invalid, primarily that /
in paths are no longer seen as path separators and that relative paths are invalid, or even just using .
and ..
in paths as the driver is not aware of the program's PWD so it cannot possibly resolve them.
Haskell programs use these filepaths internally, but at the moment we constantly have to translate between the "legacy" ones and the ones we use internally. I would expect a new API to hopefully remove this requirement.
Note that Windows has a number of these namespaces \\?\, \\.\, \\??\, \Device\
etc. Also drive letters are a user-mode concept, so a path such as \\?\Volume{4321eefsf-57fa-44sc-a5b4-d3bf28a54dft}\Windows\foo
is perfectly valid.
A modern version of GHC will accept these and do the right thing, but filepath
can't be used to maintain them so we do so in very low levels of the runtime.
[0] https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file
What representation, if any, are named streams given on Windows?
I'm not too familiar with those? Why are they relevant for filenames? Does the
filepath
package handle those somehow?
They're represented by using :<stream-name>
after the file name. So foo.c:h1
and foo.c:h2
refer to the same file but different streams. Currently filepath
doesn't handle them which causes failures when doing things such as
Prelude System.FilePath> takeExtension "hello.c:foo"
".c:foo"
which is invalid.
What representation, if any, are named streams given on Windows?
I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?
See https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams. In short, files in Windows are in fact more like directories, containing multiple named "streams" of content (vaguely reminiscent of extended attributes in the POSIX world). The "default" stream is named
$DATA
. As far as I know this is a quite rarely used feature so we would probably be fine with ignoring it (as, I believe,filepath
does).
Yes, filepath largely ignores it, but does corrupt things when they're used. e.g takeExtension
for instance is broken. They are actually used more often then one would think. For instance have you ever wondered how Windows knows when a file was downloaded from the internet with a Microsoft browser? It's data stored in a stream. Though I would agree that it's a pretty obscure feature :) I'd argue the package doesn't have to be able to manipulate it. just do things correct.
The blame for this feature though lies with Mac. Data streams were added to NTFS to support interop with Macs. Macs also support this concept in HFS. The neutral term for this is called File Forks
. See https://en.wikipedia.org/wiki/Fork_(file_system)
I don't like the design where depending on the build target the API changes. I'd need modules for Windows and POSIX paths (like filepath
has). Having a module which is either depending on the target system is fine and e.g. directory
primitives would use it. It's common enough that a *nix program has to process windows file paths and vice versa, but doesn't do any IO with them.
In perfect world e.g. tarball/zipfile internal paths could be fitted into the same story, etc. though POSIX paths are close enough, maybe?
@phadej
I'd need modules for Windows and POSIX paths
There is. Half of the library is dealing with exactly this concern:
@Mistuke @bgamari
I would expect a new API to hopefully remove this requirement.
Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs
I don't like the design where depending on the build target the API changes. I'd need modules for Windows and POSIX paths (like
filepath
has). Having a module which is either depending on the target system is fine and e.g.directory
primitives would use it. It's common enough that a *nix program has to process windows file paths and vice versa, but doesn't do any IO with them.In perfect world e.g. tarball/zipfile internal paths could be fitted into the same story, etc. though POSIX paths are close enough, maybe?
But the reality is that this doesn't match the real world. At some point you have to have platform abstractions. Not doing so is how Windows builds ended up being so much slower than the rest.
Assuming a POSIX view of the world has caused so much trouble on non-POSIX systems that really I wish we wouldn't. The world is just not POSIX.
To me doing this honestly provides no extra benefits at all over using paths as strings. I have to heavily process them anyway for any use, so I might as well use a representation that's easily read from C then.
@Mistuke @bgamari
I would expect a new API to hopefully remove this requirement.
Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs
@hasufell I don't believe anything has to necessarily change from the API point of view (obviously ideally a path doesn't have to keep repeatedly testing what kind of path it is).
But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.
It also makes an assumption that badElements
is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1
is perfectly valid and points to the first COM port.
It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\
but then makeValid
doesn't seem to.
So it's the inconsistencies that is problematic. What I'm asking for is that any path out of the functions are semantically and syntactically valid..
@hasufell so having read the proposal, (I believe that I've discussed this with @hvr before) I really like that the filepaths on Windows are represented as Utf16. This will save an immense amount of work..
The only thing I am wondering about is whether OsString
should be a datatype so you can have multiple constructors or a different way to represent what kind of Path something is.
The rational is that today whenever a native Windows path is used I can't assume that each manipulation by filepaths keeps a valid path. Which means at the usage point I have to ensure valid.
But if I can assume that usages of afp keeps them valid that'll save a lot of processing. Of course I don't want that processing to move into afp (so that one has to check what kind of path something is on every command).
If not possible then fine, the Utf16 representation is already a major improvement!
@Mistuke
@hasufell so having read the proposal, (I believe that I've discussed this with @hvr before) I really like that the filepaths on Windows are represented as Utf16. This will save an immense amount of work..
The only thing I am wondering about is whether
OsString
should be a datatype so you can have multiple constructors or a different way to represent what kind of Path something is.
I don't think so. OsString
is an abstraction around the current platforms syscall encoding.
If you want different codepaths, you can use WindowsString
and PosixString
respectively.
If you write a low-level library based on AFPP, you simply ifdef around the platform and both codepaths use different inner cosntructors WS
and PS
, see here https://github.com/hasufell/abstract-filepath/blob/2f54c68cdff99d171178e2a15421f5c92b173a7a/lib/AFP/OsString/Common.hs#L116-L121
You can't mix them up if you're wrapping them inside OsString, it will be a compile error. That's the point.
The rational is that today whenever a native Windows path is used I can't assume that each manipulation by filepaths keeps a valid path. Which means at the usage point I have to ensure valid.
But if I can assume that usages of afp keeps them valid that'll save a lot of processing. Of course I don't want that processing to move into afp (so that one has to check what kind of path something is on every command).
If not possible then fine, the Utf16 representation is already a major improvement!
On unix, only NUL bytes are invalid. Everything else goes.
From my research, windows doesn't have a definite answer about what is a valid filepath and it may depend on filesystem, windows API or the phase of the moon. As such I don't think this library should try to enforce such properties. You should be free to pass junk to syscalls if you like so.
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions looks like just conventions, but some invalid paths may still succeed, but then fail in other ways. This is up to user space library to figure out.
@Mistuke
@Mistuke @bgamari
I would expect a new API to hopefully remove this requirement.
Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs
@hasufell I don't believe anything has to necessarily change from the API point of view (obviously ideally a path doesn't have to keep repeatedly testing what kind of path it is).
But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.
It also makes an assumption that
badElements
is invalid for all paths on Windows. Which is also not true. e.g.\\.\COM1
is perfectly valid and points to the first COM port.It makes an attempt to distinguish between the different namespaces. As in, it recognizes
\\?\
but thenmakeValid
doesn't seem to.So it's the inconsistencies that is problematic. What I'm asking for is that any path out of the functions are semantically and syntactically valid..
You clearly know more about these things than me, so I'd appreciate a PR.
I took filepath
simply as a base. I'm fine to deviate from it.
I split two packages out:
I've finished the Win32 migration (only the subset of the functions that had their API changed are different, the rest are re-exports): https://github.com/hasufell/abstract-filepath/commit/52567731c718de37275324fe306971e888245446
CCing @bodigrim
I'll re-iterate and update the state of the efforts briefly.
So far, the CLC has "ack'ed" these efforts, but I have no explicit consent from Win32 or unix package maintainers. These will be the first core packages that need to support AFPP (as additional API). The types need to be put somewhere as well, @snoyberg suggested primitive
as a "not-quite-base".
I put temporary haddocks here: http://files.hasufell.de/AFPP/
I believe the core implementation is done.
We now have these utility packages that I already put on hackage:
bytestring
has a very small API for ShortByteString
, this added a lot of new stuff. Additionally, the quirky Data.ByteString.Short.Word16
module was added, which is used by the windows callstack. This needs special attention.AFPP packages are pretty much done:
filepath
API (may not be 100% the same, but I don't see why it should... @Mistuke also made some comments about changes that could be made for windows, but I'm still waiting for a PR)unsafeCoerce
in a few placesunsafeCoerce
in a few placesabstract-filepath-unix
and abstract-filepath-Win32
are complementory packages, they don't replace unix
or Win32
.
I've started some example migrations of my own packages to AFPP:
directory
and path
packages and provides filesystem operations as well as strict path types... this will be my testbed for benchmarks. The Win32 portion of the API is still WIP, but should be finished soon)Ultimately, at least abstract-filepath-unix
and abstract-filepath-Win32
need to be moved into the respective packages, because maintaining these as subset forks will be time consuming. As said, here I don't have explicit consent, so I'm not sure how or if this can move forward.
After I finish migrating hpath
, I will start conducting benchmarks. I'm not too familiar with that. I've put up a ticket https://github.com/hasufell/abstract-filepath/issues/5. This will likely require looking at https://well-typed.com/blog/2021/01/fragmentation-deeper-look/ and ghc-debug to make useful assessments about improvement in memory fragmentation.
@Mistuke ...I'm currently taking a deeper look into windows paths
But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.
Can you be more specific? Afais there are these types:
C:\
\\?\C:\
\\.\C:\
\\share
\\?\UNC\share
\\.\UNC\share
> fmap isAbsolute ["C:\\","\\\\?\\C:\\","\\\\.\\C:\\","\\\\share","\\\\?\\UNC\\share","\\\\.\\UNC\\share"]
[True,True,True,True,True,True]
> fmap isValid ["C:\\","\\\\?\\C:\\","\\\\.\\C:\\","\\\\share","\\\\?\\UNC\\share","\\\\.\\UNC\\share"]
[True,True,False,True,False,True]
So isValid
chokes only on the 5th (and by accident returns True
for the 6th and 3rd I think). Is that correct?
It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g.
\\.\COM1
is perfectly valid and points to the first COM port. It makes an attempt to distinguish between the different namespaces. As in, it recognizes\\?\
but then makeValid doesn't seem to.
Ok, so I think isValid
and makeValid
should just be no-ops when the long-path prefix is discovered? Because windows API disables parsing for these and just passes the string unchanged to the syscall. Of course, that one may choke, but I think we can assume that \\?\
denotes "I know what I'm doing".
This also aligns with the discussion at https://github.com/haskell/filepath/issues/56
I wonder what's the best way forward with shortbytestring
. Integrate it into bytestring
codebase? Make it an upstream dependency of bytestring
? Leave it as a separate "utility" package in between of bytestring
and filepath
?
I'm exceedingly busy with text-2.0
at the moment, but hopefully will get time to ponder about it in January.
I wonder what's the best way forward with shortbytestring. Integrate it into bytestring codebase? Make it an upstream dependency of bytestring? Leave it as a separate "utility" package in between of bytestring and filepath?
I think there's some code sharing in the test suite, where I had to cut out all things that aren't shortbytestring relevant. Not sure that's a good argument for merging them. It might make sense to have a split. I think small scope is good for PVP, so messing with shortbytestring (which is maybe more likely when it's started to be heavily used as a new dependency of filepath) doesn't impact the version of bytestring.
Code can be reviewed here: https://github.com/hasufell/abstract-filepath/pull/4
Rendered docs are here:
Several issues for discussion have also been opened here: https://github.com/hasufell/abstract-filepath/issues
ML thread: https://mail.haskell.org/pipermail/libraries/2021-August/031427.html
Original proposal: https://gitlab.haskell.org/ghc/ghc/-/wikis/proposal/abstract-file-path
Old discourse thread: https://discourse.haskell.org/t/reviving-the-abstract-filepath-proposal-afpp-in-user-space/2344