sheredom / utf8.h

📚 single header utf8 string functions for C and C++
The Unlicense
1.71k stars 122 forks source link

provide a function to get the previous codepoint #61

Closed mokafolio closed 3 years ago

mokafolio commented 5 years ago

As far as I can tell the library currently only provides a way to iterate over a byte sequence in one direction using utf8codepoint. It would be super useful to have a function to go the other way, too and possibly rename them to utf8next and utf8prev or something similar? I'd be down to add this if this is something that you'd merge!

Thanks for the library, it's very clean, lightweight and useful!

mokafolio commented 5 years ago

Something along these lines would be useful so you can use the libraries to implement iterators and similar things:

// returns 1 if the byte is not the start of a utf8 codepoint
int utf8trail(unsigned char byte)
{
  return (byte & 0xc0) == 0x80;
}

// returns the size of a utf8 codepoint in bytes based on the starting byte of it.
int utf8parsesize(unsigned char _startByte)
{
  if((_startByte & 0x80) == 0) //ascii
    return 1;
  else if((_startByte & 0xe0) == 0xc0)
    return 2;
  else if((_startByte & 0xf0) == 0xe0)
    return 3;
  else if((_startByte & 0xf8) == 0xf0)
    return 4;
  //error
  return 0;
}

// returns the address of the next codepoint
void *utf8next(const void *utf8_restrict str)
{
  const char *s = (const char *)str;
  return (void*)(s + utf8parsesize(s[0]));
}

// returns the address of the previous codepoint
void *utf8prev(const void *utf8_restrict str)
{
  const char *s = (const char *)str;
  while(utf8trail((--s)[0]));
  return (void*)s;
}

//decodes the utf8 codepoint at str, saves the number of bytes used to out_size
utf8_int32_t utf8decode(const void *utf8_restrict str, int * out_size)
{
  const char *s = (const char *)str;
  int sz = utf8parsesize(s[0]);
  if(sz == 4)
  {
    *out_size = 4;
    return ((0x07 & s[0]) << 18) | ((0x3f & s[1]) << 12) |
                     ((0x3f & s[2]) << 6) | (0x3f & s[3]);
  }
  else if(sz == 3)
  {
    *out_size = 3;
    return ((0x0f & s[0]) << 12) | ((0x3f & s[1]) << 6) | (0x3f & s[2]);
  }
  else if(sz == 2)
  {
    *out_size = 2;
    return ((0x1f & s[0]) << 6) | (0x3f & s[1]);
  }

  *out_size = 1;
  return s[0];
}

// new implementation of utf8codepoint using some of the stuff added to avoid duplicate implementation code
void *utf8codepoint(const void *utf8_restrict str,
                    utf8_int32_t *utf8_restrict out_codepoint) {
  int byte_count;
  *out_codepoint = utf8decode(str, &byte_count);
  return (void*)((const char*)str + byte_count);
}

I know that is a lot of new stuff but it would make the library a lot more useful for me and I am sure others, too. I guess utf8trail is not really needed as an extra function and could be removed.

Let me know if you think its useful. I can make a pull request with the newly added code and add tests for it, too.

sheredom commented 4 years ago

I'm so sorry - for some unknown reason all my repositories were unwatched by my account (I've no idea how) and it meant that every issue on every repo I have was missed by me :(

I'd be 100% fine with adding all these helper functions - assuming you are willing to add the functions + documentation + tests I'll hit merge on that PR!

mokafolio commented 4 years ago

All good, my TODO list is way too long right now, I'll put it down and hope to get to it in the near future :)