objeck / objeck-lang

Objeck is a modern object-oriented programming language with functional features tailored for machine learning. It emphasizes expression, simplicity, portability, and scalability. The programming environment consists of a compiler, virtual machine, REPL shell, and command line debugger with IDE plugins.
https://objeck.org
Other
154 stars 11 forks source link

Add `ByteArray` class #472

Closed ghost closed 6 months ago

ghost commented 7 months ago

I got the idea from QByteArray:

https://doc.qt.io/qt-6/qbytearray.html

This is mainly to support UTF-8 strings on Objeck.

.NET also has ReadOnlySpan<byte> and u8 suffix for UTF-8 strings.

objeck commented 7 months ago

Ok, why is this needed?

Objeck stores all characters as platform-specific Unicode (i.e., 32-bit or 16-bit character) strings. All character string I/O is done with UTF-8 strings, which are internally converted to wchar_t, a platform-specific type.

ghost commented 7 months ago

Ok, why is this needed?

Strings on .NET and Qt are UTF-16. It's the same reasons why they added support for UTF-8 strings.

objeck commented 7 months ago

Ok, why is this needed?

Strings on .NET and Qt are UTF-16. It's the same reasons why they added support for UTF-8 strings.

In the end, UTF-8 has become the standard string exchange character format. When Objeck runs on Windows strings are stored internally as UTF-16 and UTF-32 on Linux and macOS. What is missing is support for reading/writing UTF-16 or UTF-32 streams, which was a design choice in lieu of supporting UTF-8.

objeck commented 7 months ago

I am going to close this, as the method below exists.

The ToString(..) method will convert a UTF-8 stream into the host's native character format. The assumption is that Objeck I/O is dealing with UTF-8 streams. I have not encountered a use case for UTF-16 or UTF-32 streams outside of "it is doable," as they are not standard data exchange formats.

ba := Byte->New[3];
ba[0] := 'p';
ba[1] := 'h';
ba[2] := 'i';
ba->ToString()->PrintLine();
ghost commented 7 months ago

What about extending the existing ByteArrayRef class like this?

u8str := ByteArrayRef->New("My UTF-8 string");

objeck commented 7 months ago

The code above is already supported to convert a UTF-8 byte array.

ghost commented 7 months ago

As I know array of bytes can be used for arbitrary binary data, too. Is calling ->ToString() method (commented as for Unicode conversion on commit https://github.com/objeck/objeck-lang/commit/0e331fcadc0917dad927712f3e0e333114cd539a) something right to do?

p/s: this is only a question.

ghost commented 7 months ago

The code above is already supported to convert a UTF-8 byte array.

Yes. It's my mistake to not taking a look at the API doc first. There is already a String->ToByteArray() method. So, instead of:

u8str := ByteArrayRef->New("My UTF-8 string");

It will be u8str := ByteArrayRef->New("My UTF-8 string"->ToByteArray());

objeck commented 7 months ago

No, the prior code did a character-to-character assignment. The updated code translates a string to UTF-8 and then does a character-to-character mapping, an edge-case that was corrected.

objeck commented 7 months ago

Yes

ghost commented 6 months ago

String->ToByteArray() is not UTF-8!

Well, it's just like you cast wchar_t* to char*. The result is an array of bytes, but it's not UTF-8.

use Collection;

class Test {
    function : Main(args : String[]) ~ Nil {
        str := "Здравствуйте";
        System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
        u8str := str->ToByteArray();
        each(i := u8str) {
            c : Int := i;
            c->Print();
            " "->Print();
        }
        ""->PrintLine();
        u16str := str->ToCharArray();
        each(i := u16str) {
            c : Int := i;
            c->Print();
            " "->Print();
        }
    }
}

417 434 440 430 432 441 442 432 443 439 442 435 is UTF-16 (I'm on Windows).

17 34 40 30 32 41 42 32 43 39 42 35 is not UTF-8.

This is the result when you cast the UTF-16 array above to char*:

17 04 34 04 40 04 30 04 32 04 41 04 42 04 32 04 43 04 39 04 42 04 35 04

C code:

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <string.h>

extern void print_hex_u8(const char* const);
extern void print_hex_u16(const wchar_t* const, size_t);

int main(void) {
    const wchar_t* str = L"Здравствуйте";
    print_hex_u16(str, wcslen(str));
    print_hex_u8((char*) str);
    return EXIT_SUCCESS;
}

void print_hex_u8(const char *const str) {
    for (size_t i = 0; i < strlen(str); ++i) {
        printf("%02x ", (unsigned char) str[i]);
    }
    printf("\n");
}

void print_hex_u16(const wchar_t *const str, size_t str_len) {
    for (size_t i = 0; i < str_len; ++i) {
        printf("%02x ", str[i]);
    }
    printf("\n");
}

You can clearly see the similarity.

This is UTF-8:

d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5

ghost commented 6 months ago

@objeck

objeck commented 6 months ago

This is fixed in the system library; the code was calling a Char[] to ToString() routine, mapping a single character to a single byte. The code now calls the ToBytes(), which does a Unicode conversion of Char[] to Byte[].

Program

class Test {
    function : Main(args : String[]) ~ Nil {
        str := "Здравствуйте";
        System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
        u8str := str->ToByteArray();
        each(i := u8str) {
            c : Int := i;
            c->Print();
            " "->Print();
        }
        ""->PrintLine();
        u16str := str->ToCharArray();
        each(i := u16str) {
            c : Int := i;
            c->Print();
            " "->Print();
        }
    }
}

Output

ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5
417 434 440 430 432 441 442 432 443 439 442 435
ghost commented 6 months ago

This is fixed in the system library; the code was calling a Char[] to ToString() routine, mapping a single character to a single byte. The code now calls the ToBytes(), which does a Unicode conversion of Char[] to Byte[].

Program

class Test {
  function : Main(args : String[]) ~ Nil {
      str := "Здравствуйте";
      System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
      u8str := str->ToByteArray();
      each(i := u8str) {
          c : Int := i;
          c->Print();
          " "->Print();
      }
      ""->PrintLine();
      u16str := str->ToCharArray();
      each(i := u16str) {
          c : Int := i;
          c->Print();
          " "->Print();
      }
  }
}

Output

ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5
417 434 440 430 432 441 442 432 443 439 442 435

Thank you. Btw, how to get rid of ffffffffffffff in the output? It seems to be the sign. In C, I can simply cast to unsigned to remove it. But, Objeck doesn't offer unsigned types.

objeck commented 6 months ago

Try the following: no cast is needed for a Byte. Casting a Byte to an Int widens the positive and negative values (i.e. -128 to 127).

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
    b->Print();
    " "->Print();
}
ghost commented 6 months ago

Try the following: no cast is needed for a Byte. Casting a Byte to an Int widens the positive and negative values (i.e. -128 to 127).

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
  b->Print();
  " "->Print();
}

Then how to get rid of the 00000000000000? They are as bad as the ffffffffffffff.

objeck commented 6 months ago

It's a byte; use the Byte type; the values are negative integers. Otherwise, convert the negative Byte values to positive Int values. For reference, try ToString(), and you will see the string representation of the bytes stored. As such, unsigned negative Int values start with 0xffff per 2's compliment.

https://onlinetools.com/unicode/convert-unicode-to-bytes

objeck commented 6 months ago

Correct, Objeck does not convert from signed to unsigned values. The language does not support unsigned values. If you must try the following, it's not ideal or use the Byte type.

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
    i := b->As(Int)->Abs()->ToString();
    i->ToInt()->PrintLine();
    " "->Print();
}
ghost commented 6 months ago

We are misunderstanding each other. I'm asking for a way to format the output, similar to C's printf. The magic is really the "%02x " format specifier of printf. Objeck has nothing equivalent. Yes, I know the ffffffffffffff and the 00000000000000 are always there, but I should be able to format the output to not show them.

objeck commented 6 months ago

Ok, opened enhancement #479 for that.

objeck commented 6 months ago

Try the following after building the latest code.

There's a call to c->ToHexString() to upcast a Byte or Char using the absolute value of the Byte. From there, you can convert the hex string into a Char or Int if you want to ignore signs between types.

class Test {
    function : Main(args : String[]) ~ Nil {
        str := "Здравствуйте";
        System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
        u8str := str->ToByteArray();
        each(i := u8str) {
            c : Byte := i;
            c->ToHexString()->Print();
            " "->Print();
        }
        ""->PrintLine();
        u16str := str->ToCharArray();
        each(i := u16str) {
            c : Int := i;
            c->Print();
            " "->Print();
        }
    }
}
ghost commented 6 months ago

Correct, Objeck does not convert from signed to unsigned values. The language does not support unsigned values. If you must try the following, it's not ideal or use the Byte type.

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
  i := b->As(Int)->Abs()->ToString();
  i->ToInt()->PrintLine();
  " "->Print();
}

This code is wrong. It will not print the correct output.

Output of this code:

10 5c 53 5e

Correct UTF-8:

https://dencode.com/string/hex?v=%F0%A4%AD%A2&oe=UTF-8&nl=crlf&separator-each=&case=lower

ghost commented 6 months ago

Try the following after building the latest code.

There's a call to c->ToHexString() to upcast a Byte or Char using the absolute value of the Byte. From there, you can convert the hex string into a Char or Int if you want to ignore signs between types.

class Test {
  function : Main(args : String[]) ~ Nil {
      str := "Здравствуйте";
      System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
      u8str := str->ToByteArray();
      each(i := u8str) {
          c : Byte := i;
          c->ToHexString()->Print();
          " "->Print();
      }
      ""->PrintLine();
      u16str := str->ToCharArray();
      each(i := u16str) {
          c : Int := i;
          c->Print();
          " "->Print();
      }
  }
}

ToHexString gives the same output as ToString:

-48 -105 -48 -76 -47 -128 -48 -80 -48 -78 -47 -127 -47 -126 -48 -78 -47 -125 -48 -71 -47 -126 -48 -75
-48 -105 -48 -76 -47 -128 -48 -80 -48 -78 -47 -127 -47 -126 -48 -78 -47 -125 -48 -71 -47 -126 -48 -75

Code:

class Test {
    function : Main(args : String[]) ~ Nil {
        str := "Здравствуйте";
        System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
        u8str := str->ToByteArray();
        each(i := u8str) {
            c : Byte := i;
            c->ToHexString()->Print();
            " "->Print();
        }
        ""->PrintLine();
        each(i := u8str) {
            c : Byte := i;
            c->ToString()->Print();
            " "->Print();
        }
    }
}
objeck commented 6 months ago

@iahung2, the code from yesterday needed to be completed.

I just wrapped it up, here's an example. Note changing the compiler and runtime takes time to implement and test.

ghost commented 6 months ago

Correct, Objeck does not convert from signed to unsigned values. The language does not support unsigned values. If you must try the following, it's not ideal or use the Byte type.

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
    i := b->As(Int)->Abs()->ToString();
    i->ToInt()->PrintLine();
    " "->Print();
}

This code is wrong. It will not print the correct output.

Output of this code:

10 5c 53 5e

Correct UTF-8:

https://dencode.com/string/hex?v=%F0%A4%AD%A2&oe=UTF-8&nl=crlf&separator-each=&case=lower

Update: From what I have read on the internet, the cast to unsigned char* in C simply remove the sign bit. So, remove the sign by taking absolute value like you are doing is wrong.

ghost commented 6 months ago

@iahung2, the code from yesterday needed to be completed.

I just wrapped it up, here's an example. Note changing the compiler and runtime takes time to implement and test.

Permalink: https://github.com/objeck/objeck-lang/blob/fa736998e5eaf57a0639c3e84f4e4ba2f32255a5/programs/tests/prgm288.obs

Output:

Byte to Int
---
d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5

Byte to Hex String
---
ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5

Char to Int
---
417 434 440 430 432 441 442 432 443 439 442 435

My comment: I have lost track of the commits. I don't know what you have done with the Byte->ToInt() method to achieve this. But I don't think it's right. Byte is a signed type. Int is also a signed type. Casting from Byte to Int should preserve the sign. You hacked it to display the correct output for this very particular use case of mine. But it will fail for different use cases in the future.

objeck commented 6 months ago

Try, should work.

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
  i := b->ToInt();
  i->PrintL();
  " "->Print();
}
ghost commented 6 months ago

Try, should work.

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
  i := b->ToInt();
  i->Print();
  " "->Print();
}

So, you introduced a modified print method do make it print the expect output?

Update: There is no such method as PrintL. Compilation failed.

objeck commented 6 months ago

@iahung2, the code from yesterday needed to be completed. I just wrapped it up, here's an example. Note changing the compiler and runtime takes time to implement and test.

Permalink: https://github.com/objeck/objeck-lang/blob/fa736998e5eaf57a0639c3e84f4e4ba2f32255a5/programs/tests/prgm288.obs

Output:

Byte to Int
---
d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5

Byte to Hex String
---
ffffffffffffffd0 ffffffffffffff97 ffffffffffffffd0 ffffffffffffffb4 ffffffffffffffd1 ffffffffffffff80 ffffffffffffffd0 ffffffffffffffb0 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff81 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb2 ffffffffffffffd1 ffffffffffffff83 ffffffffffffffd0 ffffffffffffffb9 ffffffffffffffd1 ffffffffffffff82 ffffffffffffffd0 ffffffffffffffb5

Char to Int
---
417 434 440 430 432 441 442 432 443 439 442 435

My comment: I have lost track of the commits. I don't know what you have done with the Byte->ToInt() method to achieve this. But I don't think it's right. Byte is a signed type. Int is also a signed type. Casting from Byte to Int should preserve the sign. You hacked it to display the correct output for this very particular use case of mine. But it will fail for different use cases in the future.

Wow, this was not a hack and I am offended. The code does an usigned upconversion from a byte (1-byte) => (wchar_t) to 2 or 4 bytes => int (8 bytes).

objeck commented 6 months ago

Try, should work.

str := "𤭢";
System.IO.Standard->SetIntFormat(System.Number->Format->HEX);
u8str := str->ToByteArray();
each(b in u8str) {
  i := b->ToInt();
  i->Print();
  " "->Print();
}

So, you introduced a modified print method do make it print the expect output?

Update: There is no such method as PrintL. Compilation failed.

It's not a hack, I try i->Print();

ghost commented 6 months ago

Wow, this was not a hack and I am offended. The code does an usigned upconversion from a byte (1-byte) => (wchar_t) to 2 or 4 bytes => int (8 bytes).

I'm sorry. I'm only in confusion because I think it's a hack for my particular use case that could affect other use cases in the future. It turned out that I was wrong. And I'm glad about that. I don't know why you are too sensitive about the word hack, though.