Open peter-b opened 3 years ago
Some relevant discussion can be found in the SG16 mailing list archives at https://lists.isocpp.org/sg16/2021/01/2005.php. Unfortunately, the list archives have predictably mojibaked the message, but the experimental results presented are still understandable.
In recent discussion, Peter Brett and I discussed the following design to address command line options:
char
or wchar_t
based) argument as well as to transcoded char
, wchar_t
, char8_t
, char16_t
, and char32_t
representations; similar to the std::filesystem::path
string format observers.std::filesystem::path
would enable implementations to provide appropriately preserved paths portably; on Windows, the implementation could construct from the wchar_t
-based command line while implementations for other OSs construct from the char
-based command line.Code using this might look like the following (which would itself benefit from being wrapped in a nicer command line argument handling facility).
#include <filesystem>
#include <program_arguments>
#include <ranges>
#include <string>
int main() {
auto &&args = std::program_arguments(1); // Skip argument 0, keep 1...
for (auto arg = std::ranges::begin(args);
arg != std::ranges::end(args);
++arg)
{
// arguments implicitly convert to std::string_view or a similar char-based range.
// When only basic source characters are expected, comparison with execution character set is fine.
if (*arg == "--file") {
if (++arg != std::ranges::end(args)) {
// Retrieve the filename operand converted to a path.
std::filesystem::path filename = arg->path();
}
}
else if (*arg == "--name") {
if (++arg != std::ranges::end(args)) {
// Retrieve the provided username converted from the command line encoding to UTF-8.
std::u8string username = arg->u8string();
}
}
else {
...
}
}
}
Is this also the suggested model for accessing other external input, such as environment variables or (maybe) registry-style settings? I understand the fundamental idea here is "ability to convert to path or u8string or UTF-16 or UTF-32; the implementation knows best what the lossless source encoding is and will transcode as necessary".
Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux, so I'd prefer a solution that uses argc and argv as-is. (Or is this lossy on Windows?)
Is this also the suggested model for accessing other external input, such as environment variables or (maybe) registry-style settings?
I think it should work for environment variables as well, but I haven't given it as much thought yet. The situation on Windows is complicated. For a program using the Microsoft C run-time, there are (at least) three sets of environment variables:
GetEnvironmentStrings()
, FreeEnvironmentStrings()
, GetEnvironmentVariable()
, and SetEnvironmentVariable()
functions. Each of these functions comes in ANSI and Unicode variants. However, I think the ANSI versions of the first two use the OEM character set, while the ANSI versions of the latter two use the MBCS. The actual environment block may be in either ANSI or Unicode form; which one depends on if the CreateProcess()
invocation that created the process passed the CREATE_UNICODE_ENVIRONMENT
flag.getenv()
and _putenv()
functions. It is initially constructed from the Win32 environment block for a ANSI (main()
program) or lazily copied from the Unicode C run-time environment when needed (at which point the run-time synchronizes updates between the two copies)._wgetenv()
and _wputenv()
functions. It is initially constructed from the Win32 environment block for a Unicode (_wmain()
program) or lazily copied from the ANSI C run-time environment when needed (at which point the run-time synchronizes updates between the two copies).It seems clear that the implementation knows what encoding to use for each of these blocks, so it can be implementor's discretion which is used. Not losing data (due to encoding) would presumably require use of the Win32 environment block.
I believe most, if not all implementations (including Windows) permit data in the environment block that is not well-formed for any particular encoding.
Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux, so I'd prefer a solution that uses argc and argv as-is. (Or is this lossy on Windows?)
Not losing data on Windows would require using the _wmain()
entry point, so is non-portable.
Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux
I do recall that being cumbersome the last time I needed to access them. I believe it required reading /proc/<pid>/environ
(which, of course, may not be mounted). On other systems, there was generally a global __argc
and __argv
or similarly named variable pair available.
An idea I had for a while Add:
int main(int argc, char8_t** argv[]);
When this overload is selected
Arguments are utf8
Locale is C.UTF-8
The functions u8getenv
, u8putenv
are provided to handle the environment
getenv
/putenv
are also utf-8 but deal in char
when another overload is selected, u8getenv
, u8putenv
are UB
"putenv" does not exist in either C or C++.
On Wed, Mar 24, 2021, 23:40 Jens Maurer @.***> wrote:
"putenv" does not exist in either C or C++.
Right, that's POSIX, sorry
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/66#issuecomment-806230665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKX763GCB2HCQFNL6MD3GLTFJS6XANCNFSM4WOJK2YA .
Currently, C++ requires the following forms of
main()
function to be supported:Implementations can also define other additional entry points. For example, some implementations permit
main()
to accept environment variables:Some permit arguments to be accepted as wide characters; examples:
Many applications make assumptions about the encoding of the contents of
argv
(andenviron
if available). These assumptions are very rarely portable between different deployments of a single C++ implementation, let alone across multiple implementations. On some implementations individual components ofargv
andenviron
variables may have different encodings; some may not even be text. On some implementations, using the contents ofargv
may be guaranteed to lose data, and implementation-specific library functions must be use to safely access arguments.We should standardize portable ways to access data from outside the program via command-line arguments and environment variables, for example by:
main()