sg16-unicode / sg16

SG16 overview and general information
46 stars 5 forks source link

Improve portable ingestion of command-line arguments #66

Open peter-b opened 3 years ago

peter-b commented 3 years ago

Currently, C++ requires the following forms of main() function to be supported:

int main() { /* body */ }

int main(int argc, char* argv[]) { /* body */ }

Implementations can also define other additional entry points. For example, some implementations permit main() to accept environment variables:

int main(int argc, char* argv[], char* environ[]);

Some permit arguments to be accepted as wide characters; examples:

int main(int argc, wchar_t** argv[]);
int wmain(int argc, wchar_t** argv[]);

Many applications make assumptions about the encoding of the contents of argv (and environ if available). These assumptions are very rarely portable between different deployments of a single C++ implementation, let alone across multiple implementations. On some implementations individual components of argv and environ variables may have different encodings; some may not even be text. On some implementations, using the contents of argv may be guaranteed to lose data, and implementation-specific library functions must be use to safely access arguments.

We should standardize portable ways to access data from outside the program via command-line arguments and environment variables, for example by:

tahonermann commented 3 years ago

Some relevant discussion can be found in the SG16 mailing list archives at https://lists.isocpp.org/sg16/2021/01/2005.php. Unfortunately, the list archives have predictably mojibaked the message, but the experimental results presented are still understandable.

In recent discussion, Peter Brett and I discussed the following design to address command line options:

Code using this might look like the following (which would itself benefit from being wrapped in a nicer command line argument handling facility).

#include <filesystem>
#include <program_arguments>
#include <ranges>
#include <string>

int main() {
  auto &&args = std::program_arguments(1);   // Skip argument 0, keep 1...
  for (auto arg = std::ranges::begin(args);
       arg != std::ranges::end(args);
       ++arg)
  {
    // arguments implicitly convert to std::string_view or a similar char-based range.
    // When only basic source characters are expected, comparison with execution character set is fine.
    if (*arg == "--file") {
      if (++arg != std::ranges::end(args)) {
        // Retrieve the filename operand converted to a path.
        std::filesystem::path filename = arg->path();
      }
    }
    else if (*arg == "--name") {
      if (++arg != std::ranges::end(args)) {
        // Retrieve the provided username converted from the command line encoding to UTF-8.
        std::u8string username = arg->u8string();
      }
    }
    else {
      ...
    }
  }
}
jensmaurer commented 3 years ago

Is this also the suggested model for accessing other external input, such as environment variables or (maybe) registry-style settings? I understand the fundamental idea here is "ability to convert to path or u8string or UTF-16 or UTF-32; the implementation knows best what the lossless source encoding is and will transcode as necessary".

Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux, so I'd prefer a solution that uses argc and argv as-is. (Or is this lossy on Windows?)

tahonermann commented 3 years ago

Is this also the suggested model for accessing other external input, such as environment variables or (maybe) registry-style settings?

I think it should work for environment variables as well, but I haven't given it as much thought yet. The situation on Windows is complicated. For a program using the Microsoft C run-time, there are (at least) three sets of environment variables:

It seems clear that the implementation knows what encoding to use for each of these blocks, so it can be implementor's discretion which is used. Not losing data (due to encoding) would presumably require use of the Win32 environment block.

I believe most, if not all implementations (including Windows) permit data in the environment block that is not well-formed for any particular encoding.

Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux, so I'd prefer a solution that uses argc and argv as-is. (Or is this lossy on Windows?)

Not losing data on Windows would require using the _wmain() entry point, so is non-portable.

tahonermann commented 3 years ago

Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux

I do recall that being cumbersome the last time I needed to access them. I believe it required reading /proc/<pid>/environ (which, of course, may not be mounted). On other systems, there was generally a global __argc and __argv or similarly named variable pair available.

cor3ntin commented 3 years ago

An idea I had for a while Add:

int main(int argc, char8_t** argv[]);

When this overload is selected

when another overload is selected, u8getenv, u8putenv are UB

jensmaurer commented 3 years ago

"putenv" does not exist in either C or C++.

cor3ntin commented 3 years ago

On Wed, Mar 24, 2021, 23:40 Jens Maurer @.***> wrote:

"putenv" does not exist in either C or C++.

Right, that's POSIX, sorry

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/66#issuecomment-806230665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKX763GCB2HCQFNL6MD3GLTFJS6XANCNFSM4WOJK2YA .