SL.str: char* vs wchar_t* vs std::string

magol commented 7 years ago

I can not understand from the guide when I should use char* and when I should use std::string in parameters and return value.

In most of the examples you use char*, but is not that little too imprecise?

What about the difference between char* and wchar_t*?

cubbimew commented 7 years ago

There's F.25: Use a zstring or a not_null to designate a C-style string which can be seen in action in many examples in the guldelines, such as F.22: Use T or owner<T> to designate a single object. It is also mentioned again in a couple places, at least in R.2: In interfaces, use raw pointers to denote individual objects (only)

I'd say the guidance is

owning: std::string
non-owning:
- used with C APIs? gsl::zstring/gsl::czstring
- not used with C APIs? std::string_view/gsl::string_span

perhaps it should be made more explicit, such as by filling out the rule placeholder SL.str: String

As for guidelines for the use of wchar_t/char32_t/etc, that's a whole other discussion

magol commented 7 years ago

I'm not sure if I understood it quite right. By owned, do you mean that the called function is responsible for the parameter? By C-style code, do you mean that the function is calling C API, or that the data in the argument is coming from C API? What is the best way to handle interactivity with code that uses CString? Is it the same rules for returning string? Is it the same rules for in and out values?

	Owned	Not owned
	Owned	C API	Not C API
In	`std::string`	`gsl::zstring` `gsl::czstring`	`std::string_view` `gsl::string_span`
Out
In/Out

It is a lot of information that is missing in SL.str: String :-)

AraHaan commented 7 years ago

In the case of wchar_t*, std::wstring is a collection of wchar_t* as far as I know. So if you wanted to get wchar_t*'s you can always to .c_str() on an std::wstring. That is as far as I am aware of. Also the difference between an char* (a pointer to a character) versus an wchar_t* (a pointer to a wide character) The difference from char and wchar_t is that wchar_t is larger than char And Wide characters are required when writing in languages like Chinese where all of their letters (or symbols) are wider than english letters. wchar_t can also be for things in unicode too. There is also a way to convert the short characters to wide as well for wstring. Here is an example:

#include <string>
#include <iostream>

int main() {
    std::wstring data = "This is an wide string.\n";
    // Note: with wide strings you need std::wcout instead of std::cout
    std::wcout << data;
    return 0;
}

That example above has an issue on top of not being able to compile because wstring does not like char's nor does it accept them. If it was able to compile you would notice that is all the letters are so short that they will not output as English characters or if you casted them to wchar_t then the wide string data would be entirely empty. To bypass that prefix it with an L instead of casting the char's to wchar_t (Note the uppercase L).

#include <string>
#include <iostream>

int main() {
    std::wstring data = L"This is an wide string.\n";
    // Note: with wide strings you need std::wcout instead of std::cout
    std::wcout << data;
    return 0;
}

And now it whould work. Hope this explains not only the difference between wchar_t and char as well as the difference between std::string and std::wstring. I know many who fall for this thinking they are the same when 1 is actually larger than the other. And yeah there is an time and a place to use wide strings but as you can see you can make normal characters that are not wide wide with an uppercase L in front of the string. I am not entirely sure if someone was to input English characters that are not wide into std::wcin if it would make those English characters wide in a way they don't look Chinese or Japanese.

jwakely commented 7 years ago

If I understand correctly, you're asking for compilers to accept invalid code and magically transform an array of char to an array of wchar_t, which is not possible in general (because it would require assumptions about character sets and encodings). In any case, this is not a "guideline" that can be recommended as something for C++ programmers to follow.

jwakely commented 7 years ago

Oops, sorry, I closed this but meant to just add a reply to the previous comment. Reopening.

AraHaan commented 7 years ago

Ok, updated the comment.

cubbimew commented 7 years ago

And Wide characters are required when writing in languages like Chinese where all of their letters (or symbols) are wider than english letters.

not really, if you're on an OS that supports Unicode, such as Linux, this works just as well:

#include <string>
#include <iostream>
int main() {
    std::string data = "This is not a wide string, but it says 很高兴认识你.\n";
    std::cout << data; 
}

live demo http://melpon.org/wandbox/permlink/P0LUKyLzTs1xPyKu

.. but getting into that discussion would detail this thread thoroughly.

MikeGitb commented 7 years ago

@cubbimrw: Actually, this is not a question of Unicode support in general, but UTF-8 encoding in particular.

AraHaan commented 7 years ago

@cubbimew Not all systems support unicode (like you said) and Windows, by default is set to a code page and in that not all characters are supported on it. And those characters are Japanese and Chinese characters with the default code page Windows has set up (unless you reset it) Some systems however on the local is set to UTF-8. And point being not all programs are coded to automatically translate all the text on itself pased on what the code page or the local is set to. And sometmes it is just not logical for things like small Console applications. Windows does support Unicode but only if you set the code page to be able to support UTF-8 (If they even put in a way to explicitly set it to UTF-8).

MikeGitb commented 7 years ago

Please stop mixing up unicode and UTF-8. Unicode is a standard that uniquely maps all? Known characters (actually code points) to a number. UTF-8 (usually using char as data type for individual code units) is one possible encoding of that, UTF-16 (using char16_t or wchar_t on windows) a different one, but both examples use (most likely) unicode.

The problem is that afaik linux - by default - assumes that a char* points to a utf-8 encoded string, whereas a windows assumes by default that it is some single byte encoding like latin-1, which can only encode a small subset. wchar_t afaik always assumed to be a unicode code unit on both platforms, but has different sizes and encodings (2 byte, utf-16 encoding on windows, 4 byte, utf-32 encoding on linux)

cubbimew commented 7 years ago

Please stop mixing up unicode and UTF-8.

I did not use a u8 string literal on purpose. That example would work as expected on any system that supports Unicode regardless of what transformation format it chose for the narrow multibyte encoding: UTF-8, GB18030, SCSU, whatever. That said, UTF-8 has been part of Unicode for over 20 years.

2 byte, utf-16 encoding on windows

It's UCS2 on Windows, obsolete as of 1996: L'\U0001F4A9' is a wchar_t on Windows with a meaningless value, and you can't read that from a UTF-8 file with std::codecvt_utf8. Yes, some (all?) WinAPIs treat wchar_ts as UTF-16 code units, but the language and the standard library do not (although you can trick stdout/cout into treating it that way with a non-standard API call). I'm not even mentioning lack of any Unicode locales in the CRT.

char32_t could have saved the day, but LEWG voted against its use in iostreams, regexes, etc in 2006, in anticipation of a real Unicode library. Eleven years later.. here's hoping for C++20.

AndrewPardoe commented 7 years ago

Bjarne, per our meeting, please write this up during a minute slice of your infinite spare time.

BjarneStroustrup commented 7 years ago

I have made a start on the ASCII-string part of this

magol commented 7 years ago

@BjarneStroustrup Great If I can have a wish, I would like to have a list of all common string types that are in the wild and recommendation what to do with them. In the program I work with, there are a lot of MFC and Windows API, and I want that the integration with that code to work, even as I write modern code.

magol commented 7 years ago

I see that the ASCII part is there now, but what about the Unicode part? How should I do with code that use a lot of CString in the code? How should I do the modernization?

isocpp / CppCoreGuidelines

SL.str: char* vs wchar_t* vs std::string #829