uwiger / sext

Sortable Erlang Term Serialization
Apache License 2.0
105 stars 51 forks source link

Unicode atoms are not supported #40

Open seriyps opened 1 month ago

seriyps commented 1 month ago

I don't think it is a particularly popular feature, but would be nice to have it just for completeness:

> kdb_sext:encode('привет').
** exception error: bad argument
     in function  list_to_binary/1
        called as list_to_binary([1087,1088,1080,1074,1077,1090])
        *** argument 1: not an iolist term
     in call from mnesia_eleveldb_sext:encode_atom/1 (/Users/***/src/mnesia_eleveldb_sext.erl, line 374)

(note, this is our internal fork of sext, differences are minimal)

I believe smth like this should solve the problem? Can open a PR if you don't see any obvious problem with this.

diff --git a/src/mnesia_eleveldb_sext.erl b/src/mnesia_eleveldb_sext.erl
index 0a7409b..337e794 100644
--- a/src/mnesia_eleveldb_sext.erl
+++ b/src/mnesia_eleveldb_sext.erl
@@ -371,7 +371,7 @@ encode_ref(Name, Rest) ->
     <<?reference, NameEnc/binary, RestEnc/binary>>.

 encode_atom(A) ->
-    Bin = list_to_binary(atom_to_list(A)),
+    Bin = atom_to_binary(A, utf8),
     Enc = encode_bin_elems(Bin),
     <<?atom, Enc/binary>>.

@@ -760,7 +760,7 @@ partial_decode(Other) ->

 decode_atom(B) ->
     {Bin, Rest} = decode_binary(B),
-    {list_to_atom(binary_to_list(Bin)), Rest}.
+    {binary_to_atom(Bin, utf8), Rest}.

 decode_tuple(Sz, Elems) ->
     decode_tuple(Sz,Elems,[]).

At least it works for decoding/encoding for me with this patch. Didn't check if ordering is still correct, guess it should (utf-8):

> kdb_sext:decode(kdb_sext:encode('привет')).
'привет'
> kdb_sext:decode(kdb_sext:encode(hello)).
hello
uwiger commented 1 month ago

Given the core mission of sext (preserving sort order), the question is whether there is any one way to do it re. Unicode (I guess the general answer to that question for Unicode is 'no').

From the Erlang docs:

Note that UTF-8 is not compatible with bytewise representation for code points between 128 and 255, so a ISO-Latin-1 bytewise representation is not generally compatible with UTF-8.

Perhaps an option would need to be introduced ...