Closed jmillikin closed 4 months ago
Proposed implementation: https://github.com/jmillikin/upstream__rust/commit/c0ea2a4c4fb5c60908d580bafcc61466fff1b99f
Docs screenshots:
The same for &[u8]
would be useful too.
https://github.com/rust-lang/rust/pull/95290 should be relevant. Once the underlying bytes are exposed this could be implemented by user code.
PR rust-lang/rust#95290 (exposing raw WTF-8 bytes on Windows) isn't appealing to me, and even if such an API would be created I would want to have to_str_split()
so that I didn't have to implement WTF-8 decoding in my own binaries.
Gentle ping -- I'm still interested in this, is there any interest from the libs-api team regarding OsStr unicode prefix splitting?
(exposing raw WTF-8 bytes on Windows) isn't appealing to me, and even if such an API would be created I would want to have to_str_split() so that I didn't have to implement WTF-8 decoding in my own binaries.
Once you get the raw bytes you can use regular string APIs such as str::from_utf8
which gives you either the full &str
or an error that carries the necessary information to construct the rest. We have discussed adding some convenience methods to the error to make this kind of stuff simpler.
That would be more general because it isn't limited to osstr specifically and can be used for processing other [u8] sources too.
I might be misunderstanding, but I don't think that would solve the problem because (to the best of my knowledge) there is no way to convert a &[u8]
to &OsStr
on Windows. It would also require me to have #[cfg(os)]
lines in my code to handle the different internal representations (UTF-8 vs WTF-8), which is a portability hazard.
In other words, how would you implement fn to_str_split(s: &OsStr) -> (&str, &OsStr)
on top of the API you propose? Again, I might not be understanding you properly, but I can't think of any way to implement such an API on Windows.
Not only that. Even if I could convert to and from &OsStr
easily, if uses passes --file=/path/to/file<bad-utf>name
the method will give me ("--file=/path/to/file", "<bad-utf>name")
. I can check that the argument starts with --file
but that still leaves me with ("/path/to/file", "<bad-utf>name")
which I cannot concatenate without unsafe shenanigans or memory allocation.
@mina86 For trimming prefixes and suffixes from an &OsStr
, I think you'd want something more like the Pattern
methods being tracked in https://github.com/rust-lang/rust/issues/49802. I'm not aware of anyone currently working on that functionality.
This ACP is tracking a different request, which is the ability to obtain the valid Unicode portion of an &OsStr
.
This ACP is tracking a different request, which is the ability to obtain the valid Unicode portion of an &OsStr.
The motivation for this ACP is argument parsing. I don’t see how to_str_split is a good solution for that. For example, assuming to_str_split
existed in the form described here, could you demonstrate implementation of opt_starts_with
?
opt_starts_with
is just an example of an API that's not possible to implement without being able to obtain the unicode prefix. The actual code I'm writing is a bit more complex (it's a port of https://github.com/jmillikin/haskell-options/blob/master/lib/Options/Tokenize.hs).
The output type of the tokenizer step function looks approximately like this:
pub struct Arg<'a>(
#[cfg(all(target_family = "unix", not(feature = "std")))]
&'a [u8],
#[cfg(all(target_family = "windows", not(feature = "std")))]
&'a str,
#[cfg(feature = "std")]
Cow<'a, OsStr>,
);
pub enum Token<'a, FlagId> {
Arg(Arg<'a>),
Flag(FlagName<'a>, FlagId, Arg<'a>),
FlagUnary(FlagName<'a>, FlagId),
FlagHelp(Option<&'a str>),
}
To separate a single OsStr
into unary and non-unary tokens requires consuming it as a sequence of Unicode characters until encountering either a =
(long flag case) or a non-unary short flag. The set of possible flag names is maintained in a separate data structure.
When parsing an OsStr
arg, if it contains only valid Unicode then it's returned in a Cow::Borrowed
. If there's a non-Unicode suffix then I allocate an OsString
, push the two fragments into it, and return it in a Cow::Owned
.
FYI: I created a PR for the proposed implementation branch: https://github.com/rust-lang/rust/pull/111059
I'm not sure what the exact ordering is of ACP <-> unstable PR, but maybe having the implementation code available for review will help when reading the ACP.
Proposal
Add a method to
std::ffi::OsStr
that splits it into(&str /* prefix */, &OsStr /* suffix */)
, where the prefix contains valid Unicode and the suffix is the portion of the original input that is not valid Unicode.Problem statement
The
OsStr
type is designed to represent a platform-specific string, which might contain non-Unicode content. It has ato_str(&self) -> Option<&str>
method to check the string is valid Unicode, but this method operates only on the entire string. It's not currently possible to check that a portion of the string is valid Unicode in an OS-independent way.This proposal would add a method that lets an
OsStr
be split into a prefix of valid Unicode, and a suffix of remaining non-Unicode content in the platform encoding.Motivation, use-cases
Command-line options (long flags)
One of the common use cases for
OsStr
is parsing command-line options, a possible format of which has a prefix (the option name) and a suffix (the option value):Unix syntax:
Windows syntax:
When parsing CLI options, the user wants to match option names against values provided by the program, but preserve option values as they are for use with OS APIs.
This function is easy to implement on Unix because
std::os::unix::ffi::OsStrExt
provides free conversion to and from&[u8]
, which can be compared with the UTF-8 bytes of the flag name.However, on Windows, it's basically impossible to implement in safe Rust -- the Windows variant of
OsStrExt
providesIterator<Item = u16>
, and has no mechanism for constructing a non-UnicodeOsStr
at all.Command-line options (short flags)
A less ubiquitous but still commonly used format for "short options" on Unix systems allows multiple flags to be put into one option:
Supporting this requires being able to obtain a
str
with prefix"-xyz"
. The args library then uses flag definitions to tokenize it into one of["-x", "-y", "-z"]
,["-x", "-y", "z"]
, or["-x", "yz"]
.Solution sketches
Add a method
OsStr::to_str_split()
(or whatever name folks prefer) that returns the valid Unicode prefix and the non-Unicode remainder.Rules for the new function:
Note that the calling code would be responsible for handling inputs where the flag value itself is partial Unicode, for example on Unix all absolute paths start with the ASCII character
'/'
.Links and related work
The general topic of examining
OsStr
for prefixes comes up often. A selection of related issues/PRs:There is an open issue for supporting the Pattern API on
OsStr
, but (1) it's a significantly larger amount of implementation work, and (2) doesn't appear to allow extracting the prefix without unwrap.What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.