simdutf / simdutf

Unicode routines (UTF8, UTF16, UTF32) and Base64: billions of characters per second using SSE2, AVX2, NEON, AVX-512, RISC-V Vector Extension. Part of Node.js and Bun.
https://simdutf.github.io/simdutf/
Apache License 2.0
1.02k stars 62 forks source link

Add high-level C++17/C++20 conversion functions #144

Open lemire opened 1 year ago

lemire commented 1 year ago

Starting with C++11, we have a full range of specialized string classes... E.g., std::u8string, std::u16string... std::u8string_view, and so forth. Strictly speaking they were introduced with C++11 (for std::string) and C++17 (for std::string_view) but std::u8string became available with C++20.

We could use std::string, assuming that it is UTF-8, but it might also use other encodings. If we are explicit that we are assuming UTF-8 then it is ok.

What we could do is to provide conversion functions. That might be helpful to some...?

The objective would be to improve quality of life for users who prefer not to mess with pointers.

#include <string>

#ifndef SIMDUTF_CPLUSPLUS
#if defined(_MSVC_LANG) && !defined(__clang__)
#define SIMDUTF_CPLUSPLUS (_MSC_VER == 1900 ? 201103L : _MSVC_LANG)
#else
#define SIMDUTF_CPLUSPLUS __cplusplus
#endif
#endif

#if (SIMDUTF_CPLUSPLUS >= 202002L)
#define SIMDJSON_CPLUSPLUS20 1
#endif

#if (SIMDUTF_CPLUSPLUS >= 201703L)
#define SIMDJSON_CPLUSPLUS17 1
#endif

#if SIMDJSON_CPLUSPLUS17

inline std::u32string to_u32string(const std::u16string_view in) {
  return U"bogus code";
}

#if SIMDJSON_CPLUSPLUS20
inline std::u32string to_u32string(const std::u8string_view in) {
  return U"bogus code";
}
#endif 

inline std::u16string to_u16string(const std::u16string_view in) {
  return u"bogus code";
}

inline std::u16string to_u16string(const std::u32string_view in) {
  return u"bogus code";
}

int main() {
  printf("Support for C++17.\n");
  std::string mystring("hello"); // could be any encoding?
#if SIMDJSON_CPLUSPLUS20
  std::u8string mystringu8(u8"hello");
#endif
  std::u16string mystringu16(u"hello");
  std::u32string mystringu32(U"hello");
#if SIMDJSON_CPLUSPLUS20
  std::u32string mystringu8_as32 = to_u32string(mystringu8);
#endif
  std::u32string mystring_as32 = to_u32string(mystring);

}

#else
int main() { printf("No support for C++20.\n"); }
#endif

References:

https://en.cppreference.com/w/cpp/string/basic_string_view https://en.cppreference.com/w/cpp/string/basic_string

lemire commented 1 year ago

cc @NicolasJiaxin

amosnier commented 1 month ago

I'm guessing we also want to provide a std::ranges-based API with lazy evaluation. For instance, assuming a compiler that encodes string literals as UTF-8, we want the following to work:

static_assert(std::ranges::equal("$£Иह€한𐍈" | utf8::views::decode, std::array{
    0x00000024, 0x000000a3, 0x00000418, 0x00000939, 0x000020ac, 0x0000d55c, 0x00010348, 0x00000000}));

The previous static_assert also assumes that the whole implementation is constexpr, which would be nice too, I guess.