Open magol opened 7 years ago
There's F.25: Use a zstring or a not_null
I'd say the guidance is
perhaps it should be made more explicit, such as by filling out the rule placeholder SL.str: String
As for guidelines for the use of wchar_t/char32_t/etc, that's a whole other discussion
I'm not sure if I understood it quite right. By owned, do you mean that the called function is responsible for the parameter?
By C-style code, do you mean that the function is calling C API, or that the data in the argument is coming from C API?
What is the best way to handle interactivity with code that uses CString
?
Is it the same rules for returning string?
Is it the same rules for in and out values?
Owned | Not owned | ||
---|---|---|---|
C API | Not C API | ||
In | std::string |
gsl::zstring gsl::czstring |
std::string_view gsl::string_span |
Out | |||
In/Out |
It is a lot of information that is missing in SL.str: String :-)
In the case of wchar_t*
, std::wstring
is a collection of wchar_t*
as far as I know. So if you wanted to get wchar_t*
's you can always to .c_str()
on an std::wstring
. That is as far as I am aware of. Also the difference between an char*
(a pointer to a character) versus an wchar_t*
(a pointer to a wide character) The difference from char
and wchar_t
is that wchar_t
is larger than char
And Wide characters are required when writing in languages like Chinese where all of their letters (or symbols) are wider than english letters. wchar_t
can also be for things in unicode too. There is also a way to convert the short characters to wide
as well for wstring. Here is an example:
#include <string>
#include <iostream>
int main() {
std::wstring data = "This is an wide string.\n";
// Note: with wide strings you need std::wcout instead of std::cout
std::wcout << data;
return 0;
}
That example above has an issue on top of not being able to compile because wstring does not like char
's nor does it accept them. If it was able to compile you would notice that is all the letters are so short that they will not output as English characters or if you casted them to wchar_t
then the wide string data
would be entirely empty. To bypass that prefix it with an L
instead of casting the char
's to wchar_t
(Note the uppercase L).
#include <string>
#include <iostream>
int main() {
std::wstring data = L"This is an wide string.\n";
// Note: with wide strings you need std::wcout instead of std::cout
std::wcout << data;
return 0;
}
And now it whould work. Hope this explains not only the difference between wchar_t
and char
as well as the difference between std::string
and std::wstring
. I know many who fall for this thinking they are the same when 1 is actually larger than the other. And yeah there is an time and a place to use wide strings but as you can see you can make normal characters that are not wide wide with an uppercase L in front of the string. I am not entirely sure if someone was to input English characters that are not wide into std::wcin
if it would make those English characters wide in a way they don't look Chinese or Japanese.
If I understand correctly, you're asking for compilers to accept invalid code and magically transform an array of char
to an array of wchar_t
, which is not possible in general (because it would require assumptions about character sets and encodings). In any case, this is not a "guideline" that can be recommended as something for C++ programmers to follow.
Oops, sorry, I closed this but meant to just add a reply to the previous comment. Reopening.
Ok, updated the comment.
And Wide characters are required when writing in languages like Chinese where all of their letters (or symbols) are wider than english letters.
not really, if you're on an OS that supports Unicode, such as Linux, this works just as well:
#include <string>
#include <iostream>
int main() {
std::string data = "This is not a wide string, but it says 很高兴认识你.\n";
std::cout << data;
}
live demo http://melpon.org/wandbox/permlink/P0LUKyLzTs1xPyKu
.. but getting into that discussion would detail this thread thoroughly.
@cubbimrw: Actually, this is not a question of Unicode support in general, but UTF-8 encoding in particular.
@cubbimew Not all systems support unicode (like you said) and Windows, by default is set to a code page and in that not all characters are supported on it. And those characters are Japanese and Chinese characters with the default code page Windows has set up (unless you reset it) Some systems however on the local is set to UTF-8. And point being not all programs are coded to automatically translate all the text on itself pased on what the code page or the local is set to. And sometmes it is just not logical for things like small Console applications. Windows does support Unicode but only if you set the code page to be able to support UTF-8 (If they even put in a way to explicitly set it to UTF-8).
Please stop mixing up unicode and UTF-8. Unicode is a standard that uniquely maps all? Known characters (actually code points) to a number. UTF-8 (usually using char
as data type for individual code units) is one possible encoding of that, UTF-16 (using char16_t or wchar_t on windows) a different one, but both examples use (most likely) unicode.
The problem is that afaik linux - by default - assumes that a char* points to a utf-8 encoded string, whereas a windows assumes by default that it is some single byte encoding like latin-1, which can only encode a small subset. wchar_t afaik always assumed to be a unicode code unit on both platforms, but has different sizes and encodings (2 byte, utf-16 encoding on windows, 4 byte, utf-32 encoding on linux)
Please stop mixing up unicode and UTF-8.
I did not use a u8
string literal on purpose. That example would work as expected on any system that supports Unicode regardless of what transformation format it chose for the narrow multibyte encoding: UTF-8, GB18030, SCSU, whatever. That said, UTF-8 has been part of Unicode for over 20 years.
2 byte, utf-16 encoding on windows
It's UCS2 on Windows, obsolete as of 1996: L'\U0001F4A9'
is a wchar_t
on Windows with a meaningless value, and you can't read that from a UTF-8 file with std::codecvt_utf8
. Yes, some (all?) WinAPIs treat wchar_t
s as UTF-16 code units, but the language and the standard library do not (although you can trick stdout/cout into treating it that way with a non-standard API call). I'm not even mentioning lack of any Unicode locales in the CRT.
char32_t
could have saved the day, but LEWG voted against its use in iostreams, regexes, etc in 2006, in anticipation of a real Unicode library. Eleven years later.. here's hoping for C++20.
Bjarne, per our meeting, please write this up during a minute slice of your infinite spare time.
I have made a start on the ASCII-string part of this
@BjarneStroustrup Great If I can have a wish, I would like to have a list of all common string types that are in the wild and recommendation what to do with them. In the program I work with, there are a lot of MFC and Windows API, and I want that the integration with that code to work, even as I write modern code.
I see that the ASCII part is there now, but what about the Unicode part? How should I do with code that use a lot of CString in the code? How should I do the modernization?
I can not understand from the guide when I should use
char*
and when I should usestd::string
in parameters and return value.In most of the examples you use
char*
, but is not that little too imprecise?What about the difference between
char*
andwchar_t*
?