pure-c / purec

C backend for PureScript
232 stars 8 forks source link

Implement C FFI for purescript-strings #34

Closed felixSchl closed 3 years ago

felixSchl commented 5 years ago

Required by #12

felixSchl commented 5 years ago

@Lupino let's explore what we have to do get this going. Also note that I am pretty sure that the library is not required by #12 after all, so we could really avoid implementing it altogether.

felixSchl commented 5 years ago

Also, JS engines apparently encode strings in UTF-16 or UCS-2. I see no need or appeal to match JS in any way for the sake of being similar, but I wonder what the rationale is and whether we should do the same or not.

felixSchl commented 5 years ago

Also interesting: https://blog.golang.org/strings

derekdreery commented 5 years ago

I think rust-lang's string handling is awesome, and a good way to learn unicode/different encodings. At the risk of saying stuff that's already known: (note below I use javascript and ecmascript interchangeably)

In the beginning there was ascii, where a character is an 8-bit integer (since there's no maths involved, either unsigned or signed it doesn't matter). Bit patterns 0x00 to 0x7f were the standard ascii character set, and bit patterns 0x80 to 0xff were a second implementation-defined page, that could be used for, say, É in french, or characters for drawing windows in DOS.

This became a bit of a mess, with lots of encodings using different second pages. There were extensions that I don't understand, but in the end the great and good got together and developed the Unicode standard, where each atom is 4 bytes wide. This address space should be big enough to contain all possible characters in all languages.

However if most of the text you write is ascii, you are wasting 3/4 of your space with 00s if you encode every character as a Unicode code-point, so in general folks use some kind of encoding scheme that compresses this data. The most popular is utf-8 where the first 127 characters are just their ascii representation, and any wider characters use a continuation bit (I think this means that utf-8 is always in little-endian, though I may be wrong). This means that for example the string "test" takes 4 bytes, the string "tÉst" takes 5, and so on.

Well, the last sentence was a bit of a lie. The character É could be made of 1 or 2 Unicode code points. The character itself is a code point, but it could also be produced with the sequence 'E' and U+0301, the Unicode combining, accent acute modifier. Just check out the acute Wikipedia page to get a feel for how many esoteric Unicode code points there are. When there is more than 1 way to represent the same physical glyph(s), sometimes there is a canonical representation, and changing all representations to this one is called normalization.

Groups of code points that represent a single character or other unit are sometimes called grapheme clusters, but there are other names too I believe.

I don't think C knows anything about any of this. In rust, strings work the same as a vector, in that their structure on the stack is [ pointer, length, capacity ], where pointer points to a single unstructured block of data on the heap, length is the current length of the string, and capacity is the length of the allocated memory. In C, a string is just a pointer, where it is up to the user to know how much allocated memory there is. The length is determined by the first occurrence of the null (\0, 0x00) character. In rust, when working with C FFI it's generally easier to convert to a [pointer, length] or [pointer, length, capacity] format for ease of use, unless that causes unacceptable performance problems.

Hope this is useful in any case.

PS utf-16 is less compact and less standard that utf-8, but it is standard on windows and some ECMAScript strings (though not all I think) use it.

PPS (see USVString) it appears that strings in javascript are always utf-16, but there are some opaque types that may encode the string differently. They will be converted to utf-16 if they are accessed as Strings in ECMAScript.

felixSchl commented 5 years ago

@derekdreery That was extremely useful, thank you for taking the time to write this up. I originally took up utf8.h as it was easy to get started with, however it seems my understanding was incomplete. The current type representing strings is effectively void* assuming utf8 encoded contents and '\0' terminating byte. There are a few things to note about with the current representation:

  1. We require the managed_t wrapper and a GC_register_finalizer because we assume a non-GC allocated string (due to vasprintf). This is largely unrelated to the discussion, but if we can clean this up we might as well do it now.
  2. We do not track the capacity as we assume that length always equals capacity-1. I think this should be OK, though, because strings are immutable? In other words, I don't think we'll be shrinking or growing strings under the same pointer.

If we put (1) aside for a moment, and assuming that immutability means (2) is not an issue, it would seem that we can safely move ahead with the current representation? How does Utf-16 relate to all of this, will it have an impact on what representation we should choose?

derekdreery commented 5 years ago

Sounds good re. 1 and 2

I'm not an expert on utf16, although it may be that some code points that are valid in utf8 are not in utf16, or the other way round, I'm not sure - it's complicated :(

derekdreery commented 5 years ago

Since strings are immutable, you could probably use some complex data structure to save space, and cut down on copying, but probably not part of the MVP.

felixSchl commented 5 years ago

I guess from here on out it's a matter of forking purescript-strings and see how far we can get. Are you interested at all in giving this a go? I've got my eyes set on getting purescript-httpure going in PureC, using the Aff port and libuv bindings in purec-uv - I think purescript-strings is an obvious missing piece, and it would be a great usecase/demonstration of PureC.

anttih commented 5 years ago

I had a go with this using utf8.h. The issue is that, as you already mentioned, we are not able to represent all the characters that purescript parses since it uses UTF-16 😢.

Unfortunately I don't see any other way than to switch to UTF-16 strings in purec as well. What do you think we should do @derekdreery @felixSchl ? Do you know of any light UTF-16 libraries for C?

anttih commented 5 years ago

@andyarvanitis purescript-native uses UTF-8 for strings, but I also saw that purescript-strings uses UTF-16 and UTF-32 for chars. Not sure I understand this. Is there a conversion somewhere between UTF-8 and UTF-16, or is it all just UTF-8?

andyarvanitis commented 5 years ago

@anttih For purescript-native, the plan is just to use UTF-8. I've yet to implement its FFI functions for the current runtime, though (the older pure11 implementation was also UTF-8). BTW, I hope to get to this quite soon. Interestingly enough, I had also started looking at utf8.h, since the C++ std codecvt stuff is deprecated in C++17. But there shouldn't be any valid unicode code points not representable in UTF-8, unless you're talking about the (UTF-16-invalid) lone surrogates that javascript supports. I just didn't support them in pure11 and planned to do the same for purescript-native. Were there any other unicode characters/entities you encountered that were a problem?

anttih commented 5 years ago

Yeah, I read some more and it seems the only problem is with the lone surrogates. I think we can go with UTF-8 then :)

anttih commented 5 years ago

There's another big question here I think if we use UTF-8: what to do with the API for the CodePoints and the CodeUnits modules? In PureScript's JS backend, a Char is a UTF-16 code unit, whereas right now in purec a Char is a 32-bit UTF-8 code point. I think what purec does is better, but now the CodePoints API doesn't make that much sense since we can represent all code points as proper Chars.

For example, the CodePoints module has:

codePointAt :: Int -> String -> Maybe CodePoint

while the CodeUnits module has:

charAt :: Int -> String -> Maybe Char

For purec the CodePoint version behaves the same way as the Char one.

To be compatible we should implement both APIs, but this makes the API a bit confusing. Do we have any other choice than to just implement both APIs?

Edit: here's a nice explanation of the strings API by @hdgarrood

hdgarrood commented 5 years ago

Right now, a PureScript String is defined to be any sequence of UTF-16 code units (including those which don't form valid Unicode strings). You're free to use whatever encoding / layout in memory, but if you can't implement a bijection String -> Array Word16, your backend is technically incorrect. Having the internal representation of strings in this backend be UTF-8 doesn't necessarily mean you can't conform, though; see UTF-16 (Wikipedia):

U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors.

It is possible to unambiguously encode an unpaired surrogate (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in UTF-16 by using a code unit equal to the code point. The majority of UTF-16 encoder and decoder implementations do this then when translating between encodings.[citation needed] Windows allows unpaired surrogates in filenames and other places, which generally means they have to be supported by software in spite of their exclusion from the Unicode standard

Implementing the Data.String.CodeUnits API will probably be a bit awkward, as you'll probably end up having to wrap every function implementation with an encodeUtf16 on the way in and/or a decodeUtf16 on the way out, but I think it's doable, provided that the string libraries you're using don't mind about encoding the code points U+D800 to U+DFFF in UTF-8 (which they probably don't; see above).

I think there is perhaps a case to be made for having the builtin String type represent a sequence of Unicode code points, where we don't define the encoding, but this would require careful planning and a fairly big breaking change for the purescript-strings library (we'd probably want to rip out Data.String.CodeUnits and put it in a separate library which would be marked only for use with js, and only where you specifically need to handle lone surrogates.

anttih commented 5 years ago

You're free to use whatever encoding / layout in memory, but if you can't implement a bijection String -> Array Word16, your backend is technically incorrect. Having the internal representation of strings in this backend be UTF-8 doesn't necessarily mean you can't conform, though; see UTF-16 (Wikipedia):

It does mean though that Char cannot represent a code point, but has to be a Word16, right?

hdgarrood commented 5 years ago

Yes, although I would be in favour of changing the language so that Char is actually a code point rather than a UTF-16 code unit. I think having the UTF-16 representation leak through there is a bit of a wart; it's sort of inconsistent with the approach we've taken in the strings library, which is that we provide functions you can use if speed is really necessary / you know you need to operate with UTF-16 code units, but the functions you get if you just import Data.String are the ones which operate on code points and which don't expose details of how the strings are actually encoded in memory.

natefaubion commented 5 years ago

@hdgarrood Should we create a compiler ticket? I agree, and I would like it if the language was representation agnostic.

hdgarrood commented 5 years ago

Sounds good to me :+1:

anttih commented 5 years ago

So either we

a) change Char in purec to be a UTF-16 code unit, and find a way to encode/decode from utf-8 encoded String to Char for the use in CodeUnits (I didn't find anything that implements this by a quick googling)

b) have Char be a UTF-8 code point and implement CodeUnits by actually using code points underneath 😅

a) doesn't sound appealing to me. It might be a lot of work for not really being better in any way except being compatible with the current language. I think there has to be a middle ground here. Maybe we could solve some of the issues not by changing the language but by not using Char all that much, as suggested here.

anttih commented 5 years ago

Another challenge here is that we need to support the NULL byte to be unicode compliant, so just using NULL teminated strings won't work. I think we have to switch to a string "slice" where we track the length.

Edit: looks like we could read in \0 as two bytes: 0xC0 0x80 https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8