hasufell / abstract-filepath

13 stars 0 forks source link

Gather code and design reviews #10

Open hasufell opened 3 years ago

hasufell commented 3 years ago

Code can be reviewed here: https://github.com/hasufell/abstract-filepath/pull/4

Rendered docs are here:

Several issues for discussion have also been opened here: https://github.com/hasufell/abstract-filepath/issues

ML thread: https://mail.haskell.org/pipermail/libraries/2021-August/031427.html

Original proposal: https://gitlab.haskell.org/ghc/ghc/-/wikis/proposal/abstract-file-path

Old discourse thread: https://discourse.haskell.org/t/reviving-the-abstract-filepath-proposal-afpp-in-user-space/2344

hasufell commented 3 years ago

pinging some folks @snoyberg @emilypi @kleidukos @sclv @gbaz @phadej @bgamari

bgamari commented 3 years ago

Ccing @Mistuke

Thanks for putting the time into pushing this forward, @hasufell! Some initial thoughts:

hasufell commented 3 years ago

@bgamari

Should AbstractFilePath rather be a newtype? Afterall, not all strings are paths. If it were a newtype we would need to also provide oss :: QuasiQuoter and associated combinators

I had given this some thought, the options were as follows:

  1. A type synonym: type AbstractFilePath = OsString (or vice versa). This is similar to type FilePath = String. The disadvantage is that we lose the type level distinction and it's more like a hint.
  2. Use another newtype. This may be unergonomic, given that they actually maintain the same invariants. AbstractFilePath does not ensure filepath validity anyway (since there is no definite list of what is allowed, only hints. Even the filesystem may have an impact on what is valid. As such we only maintain the encoding). This approach may be more interesting if AbstractFilePath had more invariants, such as relative vs absolute on type level (compare with Path and hpath libraries).
  3. Provide only one type for both use cases and rename to e.g. OsString.
  4. Convert non-filepath things to String or Text. This kinda goes against the original proposal of not losing the encoding information and not roundtripping through types.

So my argument is: AbstractFilePath doesn't maintain further restrictions over OsString and it probably shouldn't (since these are on e.g. windows fuzzy and not strict, depend on filesystem used, possibly windows version etc etc). This library/proposal is only about encoding correctness and memory efficiency. Having more guarantees about filepaths on type level is out of scope, IMO. A user-space library can provide these things (there are hpath, path and others and they make different trade-offs, the design space isn't small).

Is the AbstractFilePath module expected to be imported qualified?

I have no strong opinion, what's your verdict?

the semantics of makeValid are very vague; if we want this operation it should be well-documented and stable

True. However, most (not all) functions are ported from filepath: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Posix.html#v:makeValid

the name OsWord is a bit confusing since it actually refers to a codepoint (although under POSIX it is a codepoint in the trivial ASCII encoding)

What name do you suggest?

Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.

Yes and I believe it does. The filepath package handles it the same way: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Windows.html#v:pathSeparators

Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:

"The name of the file or device to be created or opened. You may use either forward slashes (/) or backslashes (\) in this name."

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

bgamari commented 3 years ago

So my argument is: AbstractFilePath doesn't maintain further restrictions over OsString and it probably shouldn't (since these are on e.g. windows fuzzy and not strict, depend on filesystem used, possibly windows version etc etc).

Fair.

Is the AbstractFilePath module expected to be imported qualified?

I have no strong opinion, what's your verdict?

I think it should be treated as a qualified import.

True. However, most (not all) functions are ported from filepath: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Posix.html#v:makeValid

Fair, but I think we should take this opportunity to clarify the semantics (or at least clearly document that the function will merely make a fuzzy "best-effort" attempt at fixing up the path.

What name do you suggest?

OsChar perhaps? This slightly conflates characters and codepoints, but this is a fairly common conflation and I would argue that it's clearer than OsWord.

Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:

Ahh, good catch.

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

See https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams. In short, files in Windows are in fact more like directories, containing multiple named "streams" of content (vaguely reminiscent of extended attributes in the POSIX world). The "default" stream is named $DATA. As far as I know this is a quite rarely used feature so we would probably be fine with ignoring it (as, I believe, filepath does).

Mistuke commented 3 years ago

Is / in pathSeparators on Windows? It seems the answer is "yes" but I don't believe that CreateFile would accept such a path.

Yes and I believe it does. The filepath package handles it the same way: https://hackage.haskell.org/package/filepath-1.4.2.1/docs/System-FilePath-Windows.html#v:pathSeparators

Also see https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew:

"The name of the file or device to be created or opened. You may use either forward slashes (/) or backslashes () in this name."

This is only true when using the APIs in "legacy" mode. In "legacy" mode the Win32 API calls will translate paths from their legacy form e.g. C:\Windows\foo into the form the driver actually understands \\?\C:\Windows\foo. One of the main differences is that the path is no longer pre-processed by the user-mode part of the Window API[0].

This results in accesses being faster, but also in an application not being restricted to legacy issues (e.g. MAX_PATH). But this also means various things become invalid, primarily that / in paths are no longer seen as path separators and that relative paths are invalid, or even just using . and .. in paths as the driver is not aware of the program's PWD so it cannot possibly resolve them.

Haskell programs use these filepaths internally, but at the moment we constantly have to translate between the "legacy" ones and the ones we use internally. I would expect a new API to hopefully remove this requirement.

Note that Windows has a number of these namespaces \\?\, \\.\, \\??\, \Device\ etc. Also drive letters are a user-mode concept, so a path such as \\?\Volume{4321eefsf-57fa-44sc-a5b4-d3bf28a54dft}\Windows\foo is perfectly valid.

A modern version of GHC will accept these and do the right thing, but filepath can't be used to maintain them so we do so in very low levels of the runtime.

[0] https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

They're represented by using :<stream-name> after the file name. So foo.c:h1 and foo.c:h2 refer to the same file but different streams. Currently filepath doesn't handle them which causes failures when doing things such as

Prelude System.FilePath> takeExtension "hello.c:foo"
".c:foo"

which is invalid.

Mistuke commented 3 years ago

What representation, if any, are named streams given on Windows?

I'm not too familiar with those? Why are they relevant for filenames? Does the filepath package handle those somehow?

See https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams. In short, files in Windows are in fact more like directories, containing multiple named "streams" of content (vaguely reminiscent of extended attributes in the POSIX world). The "default" stream is named $DATA. As far as I know this is a quite rarely used feature so we would probably be fine with ignoring it (as, I believe, filepath does).

Yes, filepath largely ignores it, but does corrupt things when they're used. e.g takeExtension for instance is broken. They are actually used more often then one would think. For instance have you ever wondered how Windows knows when a file was downloaded from the internet with a Microsoft browser? It's data stored in a stream. Though I would agree that it's a pretty obscure feature :) I'd argue the package doesn't have to be able to manipulate it. just do things correct.

The blame for this feature though lies with Mac. Data streams were added to NTFS to support interop with Macs. Macs also support this concept in HFS. The neutral term for this is called File Forks. See https://en.wikipedia.org/wiki/Fork_(file_system)

phadej commented 3 years ago

I don't like the design where depending on the build target the API changes. I'd need modules for Windows and POSIX paths (like filepath has). Having a module which is either depending on the target system is fine and e.g. directory primitives would use it. It's common enough that a *nix program has to process windows file paths and vice versa, but doesn't do any IO with them.

In perfect world e.g. tarball/zipfile internal paths could be fitted into the same story, etc. though POSIX paths are close enough, maybe?

hasufell commented 3 years ago

@phadej

I'd need modules for Windows and POSIX paths

There is. Half of the library is dealing with exactly this concern:

hasufell commented 3 years ago

@Mistuke @bgamari

I would expect a new API to hopefully remove this requirement.

Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs

Mistuke commented 3 years ago

I don't like the design where depending on the build target the API changes. I'd need modules for Windows and POSIX paths (like filepath has). Having a module which is either depending on the target system is fine and e.g. directory primitives would use it. It's common enough that a *nix program has to process windows file paths and vice versa, but doesn't do any IO with them.

In perfect world e.g. tarball/zipfile internal paths could be fitted into the same story, etc. though POSIX paths are close enough, maybe?

But the reality is that this doesn't match the real world. At some point you have to have platform abstractions. Not doing so is how Windows builds ended up being so much slower than the rest.

Assuming a POSIX view of the world has caused so much trouble on non-POSIX systems that really I wish we wouldn't. The world is just not POSIX.

To me doing this honestly provides no extra benefits at all over using paths as strings. I have to heavily process them anyway for any use, so I might as well use a representation that's easily read from C then.

Mistuke commented 3 years ago

@Mistuke @bgamari

I would expect a new API to hopefully remove this requirement.

Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs

@hasufell I don't believe anything has to necessarily change from the API point of view (obviously ideally a path doesn't have to keep repeatedly testing what kind of path it is).

But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.

It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1 is perfectly valid and points to the first COM port.

It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\ but then makeValid doesn't seem to.

So it's the inconsistencies that is problematic. What I'm asking for is that any path out of the functions are semantically and syntactically valid..

Mistuke commented 3 years ago

@hasufell so having read the proposal, (I believe that I've discussed this with @hvr before) I really like that the filepaths on Windows are represented as Utf16. This will save an immense amount of work..

The only thing I am wondering about is whether OsString should be a datatype so you can have multiple constructors or a different way to represent what kind of Path something is.

The rational is that today whenever a native Windows path is used I can't assume that each manipulation by filepaths keeps a valid path. Which means at the usage point I have to ensure valid.

But if I can assume that usages of afp keeps them valid that'll save a lot of processing. Of course I don't want that processing to move into afp (so that one has to check what kind of path something is on every command).

If not possible then fine, the Utf16 representation is already a major improvement!

hasufell commented 3 years ago

@Mistuke

@hasufell so having read the proposal, (I believe that I've discussed this with @hvr before) I really like that the filepaths on Windows are represented as Utf16. This will save an immense amount of work..

The only thing I am wondering about is whether OsString should be a datatype so you can have multiple constructors or a different way to represent what kind of Path something is.

I don't think so. OsString is an abstraction around the current platforms syscall encoding.

If you want different codepaths, you can use WindowsString and PosixString respectively.

If you write a low-level library based on AFPP, you simply ifdef around the platform and both codepaths use different inner cosntructors WS and PS, see here https://github.com/hasufell/abstract-filepath/blob/2f54c68cdff99d171178e2a15421f5c92b173a7a/lib/AFP/OsString/Common.hs#L116-L121

You can't mix them up if you're wrapping them inside OsString, it will be a compile error. That's the point.

The rational is that today whenever a native Windows path is used I can't assume that each manipulation by filepaths keeps a valid path. Which means at the usage point I have to ensure valid.

But if I can assume that usages of afp keeps them valid that'll save a lot of processing. Of course I don't want that processing to move into afp (so that one has to check what kind of path something is on every command).

If not possible then fine, the Utf16 representation is already a major improvement!

On unix, only NUL bytes are invalid. Everything else goes.

From my research, windows doesn't have a definite answer about what is a valid filepath and it may depend on filesystem, windows API or the phase of the moon. As such I don't think this library should try to enforce such properties. You should be free to pass junk to syscalls if you like so.

https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions looks like just conventions, but some invalid paths may still succeed, but then fail in other ways. This is up to user space library to figure out.

hasufell commented 3 years ago

@Mistuke

@Mistuke @bgamari

I would expect a new API to hopefully remove this requirement.

Well, I don't have a clear picture what needs to change. So I'd rely on pull requests. The filepath ported functions are all in https://github.com/hasufell/abstract-filepath/blob/master/lib/AFP/AbstractFilePath/Internal/Common.hs

@hasufell I don't believe anything has to necessarily change from the API point of view (obviously ideally a path doesn't have to keep repeatedly testing what kind of path it is).

But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.

It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1 is perfectly valid and points to the first COM port.

It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\ but then makeValid doesn't seem to.

So it's the inconsistencies that is problematic. What I'm asking for is that any path out of the functions are semantically and syntactically valid..

You clearly know more about these things than me, so I'd appreciate a PR.

I took filepath simply as a base. I'm fine to deviate from it.

hasufell commented 3 years ago

I split two packages out:

hasufell commented 3 years ago

I've finished the Win32 migration (only the subset of the functions that had their API changed are different, the rest are re-exports): https://github.com/hasufell/abstract-filepath/commit/52567731c718de37275324fe306971e888245446

hasufell commented 3 years ago

CCing @bodigrim

I'll re-iterate and update the state of the efforts briefly.


Context

So far, the CLC has "ack'ed" these efforts, but I have no explicit consent from Win32 or unix package maintainers. These will be the first core packages that need to support AFPP (as additional API). The types need to be put somewhere as well, @snoyberg suggested primitive as a "not-quite-base".

I put temporary haddocks here: http://files.hasufell.de/AFPP/

Status

I believe the core implementation is done.

We now have these utility packages that I already put on hackage:

AFPP packages are pretty much done:

abstract-filepath-unix and abstract-filepath-Win32 are complementory packages, they don't replace unix or Win32.

I've started some example migrations of my own packages to AFPP:

How to move forward

Ultimately, at least abstract-filepath-unix and abstract-filepath-Win32 need to be moved into the respective packages, because maintaining these as subset forks will be time consuming. As said, here I don't have explicit consent, so I'm not sure how or if this can move forward.

After I finish migrating hpath, I will start conducting benchmarks. I'm not too familiar with that. I've put up a ticket https://github.com/hasufell/abstract-filepath/issues/5. This will likely require looking at https://well-typed.com/blog/2021/01/fragmentation-deeper-look/ and ghc-debug to make useful assessments about improvement in memory fragmentation.

hasufell commented 3 years ago

@Mistuke ...I'm currently taking a deeper look into windows paths

But as an example the file you linked makes an assumption that the only valid absolute filepath on Windows is one that starts with a drive letter. Which is not true.

Can you be more specific? Afais there are these types:

  1. C:\
  2. \\?\C:\
  3. \\.\C:\
  4. \\share
  5. \\?\UNC\share
  6. \\.\UNC\share
> fmap isAbsolute ["C:\\","\\\\?\\C:\\","\\\\.\\C:\\","\\\\share","\\\\?\\UNC\\share","\\\\.\\UNC\\share"]
[True,True,True,True,True,True]
> fmap isValid ["C:\\","\\\\?\\C:\\","\\\\.\\C:\\","\\\\share","\\\\?\\UNC\\share","\\\\.\\UNC\\share"]
[True,True,False,True,False,True]

So isValid chokes only on the 5th (and by accident returns True for the 6th and 3rd I think). Is that correct?

It also makes an assumption that badElements is invalid for all paths on Windows. Which is also not true. e.g. \\.\COM1 is perfectly valid and points to the first COM port. It makes an attempt to distinguish between the different namespaces. As in, it recognizes \\?\ but then makeValid doesn't seem to.

Ok, so I think isValid and makeValid should just be no-ops when the long-path prefix is discovered? Because windows API disables parsing for these and just passes the string unchanged to the syscall. Of course, that one may choke, but I think we can assume that \\?\ denotes "I know what I'm doing".

This also aligns with the discussion at https://github.com/haskell/filepath/issues/56

Bodigrim commented 2 years ago

I wonder what's the best way forward with shortbytestring. Integrate it into bytestring codebase? Make it an upstream dependency of bytestring? Leave it as a separate "utility" package in between of bytestring and filepath?

I'm exceedingly busy with text-2.0 at the moment, but hopefully will get time to ponder about it in January.

hasufell commented 2 years ago

I wonder what's the best way forward with shortbytestring. Integrate it into bytestring codebase? Make it an upstream dependency of bytestring? Leave it as a separate "utility" package in between of bytestring and filepath?

I think there's some code sharing in the test suite, where I had to cut out all things that aren't shortbytestring relevant. Not sure that's a good argument for merging them. It might make sense to have a split. I think small scope is good for PVP, so messing with shortbytestring (which is maybe more likely when it's started to be heavily used as a new dependency of filepath) doesn't impact the version of bytestring.