Closed ghost closed 6 months ago
Ok, why is this needed?
Objeck stores all characters as platform-specific Unicode (i.e., 32-bit or 16-bit character) strings. All character string I/O is done with UTF-8 strings, which are internally converted to wchar_t
, a platform-specific type.
Ok, why is this needed?
Strings on .NET and Qt are UTF-16. It's the same reasons why they added support for UTF-8 strings.
Ok, why is this needed?
Strings on .NET and Qt are UTF-16. It's the same reasons why they added support for UTF-8 strings.
In the end, UTF-8 has become the standard string exchange character format. When Objeck runs on Windows strings are stored internally as UTF-16 and UTF-32 on Linux and macOS. What is missing is support for reading/writing UTF-16 or UTF-32 streams, which was a design choice in lieu of supporting UTF-8.
I am going to close this, as the method below exists.
The ToString(..)
method will convert a UTF-8 stream into the host's native character format. The assumption is that Objeck I/O is dealing with UTF-8 streams. I have not encountered a use case for UTF-16 or UTF-32 streams outside of "it is doable," as they are not standard data exchange formats.
ba := Byte->New[3];
ba[0] := 'p';
ba[1] := 'h';
ba[2] := 'i';
ba->ToString()->PrintLine();
What about extending the existing ByteArrayRef
class like this?
u8str := ByteArrayRef->New("My UTF-8 string");
The code above is already supported to convert a UTF-8 byte array.
As I know array of bytes can be used for arbitrary binary data, too. Is calling ->ToString()
method (commented as for Unicode conversion on commit https://github.com/objeck/objeck-lang/commit/0e331fcadc0917dad927712f3e0e333114cd539a) something right to do?
p/s: this is only a question.
The code above is already supported to convert a UTF-8 byte array.
Yes. It's my mistake to not taking a look at the API doc first. There is already a String->ToByteArray()
method. So, instead of:
u8str := ByteArrayRef->New("My UTF-8 string");
It will be u8str := ByteArrayRef->New("My UTF-8 string"->ToByteArray());
No, the prior code did a character-to-character assignment. The updated code translates a string to UTF-8 and then does a character-to-character mapping, an edge-case that was corrected.
Yes
String->ToByteArray()
is not UTF-8!
Well, it's just like you cast wchar_t*
to char*
. The result is an array of bytes
, but it's not UTF-8
.
use Collection;
class Test {
function : Main(args : String[]) ~ Nil {
str := "Здравствуйте";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(i := u8str) {
c : Int := i;
c->Print();
" "->Print();
}
""->PrintLine();
u16str := str->ToCharArray();
each(i := u16str) {
c : Int := i;
c->Print();
" "->Print();
}
}
}
417 434 440 430 432 441 442 432 443 439 442 435
is UTF-16 (I'm on Windows).
17 34 40 30 32 41 42 32 43 39 42 35
is not UTF-8.
This is the result when you cast the UTF-16 array above to char*
:
17 04 34 04 40 04 30 04 32 04 41 04 42 04 32 04 43 04 39 04 42 04 35 04
C code:
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <string.h>
extern void print_hex_u8(const char* const);
extern void print_hex_u16(const wchar_t* const, size_t);
int main(void) {
const wchar_t* str = L"Здравствуйте";
print_hex_u16(str, wcslen(str));
print_hex_u8((char*) str);
return EXIT_SUCCESS;
}
void print_hex_u8(const char *const str) {
for (size_t i = 0; i < strlen(str); ++i) {
printf("%02x ", (unsigned char) str[i]);
}
printf("\n");
}
void print_hex_u16(const wchar_t *const str, size_t str_len) {
for (size_t i = 0; i < str_len; ++i) {
printf("%02x ", str[i]);
}
printf("\n");
}
You can clearly see the similarity.
This is UTF-8:
d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5
@objeck
This is fixed in the system library; the code was calling a Char[]
to ToString()
routine, mapping a single character to a single byte. The code now calls the ToBytes()
, which does a Unicode conversion of Char[]
to Byte[]
.
Program
class Test {
function : Main(args : String[]) ~ Nil {
str := "Здравствуйте";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(i := u8str) {
c : Int := i;
c->Print();
" "->Print();
}
""->PrintLine();
u16str := str->ToCharArray();
each(i := u16str) {
c : Int := i;
c->Print();
" "->Print();
}
}
}
Output
ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5
417 434 440 430 432 441 442 432 443 439 442 435
This is fixed in the system library; the code was calling a
Char[]
toToString()
routine, mapping a single character to a single byte. The code now calls theToBytes()
, which does a Unicode conversion ofChar[]
toByte[]
.Program
class Test { function : Main(args : String[]) ~ Nil { str := "Здравствуйте"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(i := u8str) { c : Int := i; c->Print(); " "->Print(); } ""->PrintLine(); u16str := str->ToCharArray(); each(i := u16str) { c : Int := i; c->Print(); " "->Print(); } } }
Output
ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5 417 434 440 430 432 441 442 432 443 439 442 435
Thank you. Btw, how to get rid of ffffffffffffff
in the output? It seems to be the sign. In C, I can simply cast to unsigned to remove it. But, Objeck doesn't offer unsigned types.
Try the following: no cast is needed for a Byte
. Casting a Byte
to an Int
widens the positive and negative values (i.e. -128 to 127).
str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
b->Print();
" "->Print();
}
Try the following: no cast is needed for a
Byte
. Casting aByte
to anInt
widens the positive and negative values (i.e. -128 to 127).str := "𤭢"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(b in u8str) { b->Print(); " "->Print(); }
Then how to get rid of the 00000000000000
? They are as bad as the ffffffffffffff
.
It's a byte; use the Byte
type; the values are negative integers. Otherwise, convert the negative Byte
values to positive Int
values. For reference, try ToString()
, and you will see the string representation of the bytes stored. As such, unsigned negative Int
values start with 0xffff
per 2's compliment.
Correct, Objeck does not convert from signed to unsigned values. The language does not support unsigned values. If you must try the following, it's not ideal or use the Byte
type.
str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
i := b->As(Int)->Abs()->ToString();
i->ToInt()->PrintLine();
" "->Print();
}
We are misunderstanding each other. I'm asking for a way to format the output, similar to C's printf
. The magic is really the "%02x "
format specifier of printf
. Objeck has nothing equivalent. Yes, I know the ffffffffffffff
and the 00000000000000
are always there, but I should be able to format the output to not show them.
Ok, opened enhancement #479 for that.
Try the following after building the latest code.
There's a call to c->ToHexString()
to upcast a Byte
or Char
using the absolute value of the Byte
. From there, you can convert the hex string into a Char
or Int
if you want to ignore signs between types.
class Test {
function : Main(args : String[]) ~ Nil {
str := "Здравствуйте";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(i := u8str) {
c : Byte := i;
c->ToHexString()->Print();
" "->Print();
}
""->PrintLine();
u16str := str->ToCharArray();
each(i := u16str) {
c : Int := i;
c->Print();
" "->Print();
}
}
}
Correct, Objeck does not convert from signed to unsigned values. The language does not support unsigned values. If you must try the following, it's not ideal or use the
Byte
type.str := "𤭢"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(b in u8str) { i := b->As(Int)->Abs()->ToString(); i->ToInt()->PrintLine(); " "->Print(); }
This code is wrong. It will not print the correct output.
Output of this code:
10 5c 53 5e
Correct UTF-8:
https://dencode.com/string/hex?v=%F0%A4%AD%A2&oe=UTF-8&nl=crlf&separator-each=&case=lower
Try the following after building the latest code.
There's a call to
c->ToHexString()
to upcast aByte
orChar
using the absolute value of theByte
. From there, you can convert the hex string into aChar
orInt
if you want to ignore signs between types.class Test { function : Main(args : String[]) ~ Nil { str := "Здравствуйте"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(i := u8str) { c : Byte := i; c->ToHexString()->Print(); " "->Print(); } ""->PrintLine(); u16str := str->ToCharArray(); each(i := u16str) { c : Int := i; c->Print(); " "->Print(); } } }
ToHexString
gives the same output as ToString
:
-48 -105 -48 -76 -47 -128 -48 -80 -48 -78 -47 -127 -47 -126 -48 -78 -47 -125 -48 -71 -47 -126 -48 -75
-48 -105 -48 -76 -47 -128 -48 -80 -48 -78 -47 -127 -47 -126 -48 -78 -47 -125 -48 -71 -47 -126 -48 -75
Code:
class Test {
function : Main(args : String[]) ~ Nil {
str := "Здравствуйте";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(i := u8str) {
c : Byte := i;
c->ToHexString()->Print();
" "->Print();
}
""->PrintLine();
each(i := u8str) {
c : Byte := i;
c->ToString()->Print();
" "->Print();
}
}
}
@iahung2, the code from yesterday needed to be completed.
I just wrapped it up, here's an example. Note changing the compiler and runtime takes time to implement and test.
Correct, Objeck does not convert from signed to unsigned values. The language does not support unsigned values. If you must try the following, it's not ideal or use the
Byte
type.str := "𤭢"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(b in u8str) { i := b->As(Int)->Abs()->ToString(); i->ToInt()->PrintLine(); " "->Print(); }
This code is wrong. It will not print the correct output.
Output of this code:
10 5c 53 5e
Correct UTF-8:
https://dencode.com/string/hex?v=%F0%A4%AD%A2&oe=UTF-8&nl=crlf&separator-each=&case=lower
Update: From what I have read on the internet, the cast to unsigned char*
in C simply remove the sign bit. So, remove the sign by taking absolute value like you are doing is wrong.
@iahung2, the code from yesterday needed to be completed.
I just wrapped it up, here's an example. Note changing the compiler and runtime takes time to implement and test.
Output:
Byte to Int
---
d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5
Byte to Hex String
---
ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5
Char to Int
---
417 434 440 430 432 441 442 432 443 439 442 435
My comment: I have lost track of the commits. I don't know what you have done with the Byte->ToInt()
method to achieve this. But I don't think it's right. Byte
is a signed type. Int
is also a signed type. Casting from Byte
to Int
should preserve the sign. You hacked it to display the correct output for this very particular use case of mine. But it will fail for different use cases in the future.
Try, should work.
str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
i := b->ToInt();
i->PrintL();
" "->Print();
}
Try, should work.
str := "𤭢"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(b in u8str) { i := b->ToInt(); i->Print(); " "->Print(); }
So, you introduced a modified print method do make it print the expect output?
Update: There is no such method as PrintL
. Compilation failed.
@iahung2, the code from yesterday needed to be completed. I just wrapped it up, here's an example. Note changing the compiler and runtime takes time to implement and test.
Output:
Byte to Int --- d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5 Byte to Hex String --- ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5 Char to Int --- 417 434 440 430 432 441 442 432 443 439 442 435
My comment: I have lost track of the commits. I don't know what you have done with the
Byte->ToInt()
method to achieve this. But I don't think it's right.Byte
is a signed type.Int
is also a signed type. Casting fromByte
toInt
should preserve the sign. You hacked it to display the correct output for this very particular use case of mine. But it will fail for different use cases in the future.
Wow, this was not a hack and I am offended. The code does an usigned
upconversion from a byte (1-byte) => (wchar_t) to 2 or 4 bytes => int (8 bytes).
Try, should work.
str := "𤭢"; System.IO.Standard->SetIntFormat(System.Number->Format->HEX); u8str := str->ToByteArray(); each(b in u8str) { i := b->ToInt(); i->Print(); " "->Print(); }
So, you introduced a modified print method do make it print the expect output?
Update: There is no such method as
PrintL
. Compilation failed.
It's not a hack, I try i->Print();
Wow, this was not a hack and I am offended. The code does an
usigned
upconversion from a byte (1-byte) => (wchar_t) to 2 or 4 bytes => int (8 bytes).
I'm sorry. I'm only in confusion because I think it's a hack for my particular use case that could affect other use cases in the future. It turned out that I was wrong. And I'm glad about that. I don't know why you are too sensitive about the word hack, though.
I got the idea from
QByteArray
:https://doc.qt.io/qt-6/qbytearray.html
This is mainly to support UTF-8 strings on Objeck.
.NET also has
ReadOnlySpan<byte>
andu8
suffix for UTF-8 strings.