michalmuskala / jason

A blazing fast JSON parser and generator in pure Elixir.
Other
1.6k stars 170 forks source link

Error decoding charlist with accentuated characters #128

Closed 3duard0 closed 3 years ago

3duard0 commented 3 years ago

Some accentuated characters seems to result in error when decoding json as charlist.

# This is the same as {"username":"ciuça"}
data = [123, 34, 117, 115, 101, 114, 110, 97, 109, 101, 34, 58, 34, 99, 105, 117, 231, 97,  34, 125]

# If I convert data to string there's no error while decoding
data |> to_string |> Jason.decode!()
# %{"username" => "ciuça"}

# If I try to decode without converting it throws an error in parser
data  |> Jason.decode!()
# ** (Jason.DecodeError) unexpected byte at position 16: 0xE7
#    (jason 1.2.2) lib/jason.ex:78: Jason.decode!/2

If I remove the offending character(ç) it works

# This is the same as {"username":"ciua"}
data = [123, 34, 117, 115, 101, 114, 110, 97, 109, 101, 34, 58, 34, 99, 105, 117, 97,  34, 125]

# If I convert data to string there's no error while decoding
data |> to_string |> Jason.decode!()
# %{"username" => "ciua"}

# If I try to decode without converting it throws an error in parser
data  |> Jason.decode!()
# %{"username" => "ciua"}
3duard0 commented 3 years ago

There is also an error ocurring for Jason.encode!(), but I could not replicate. I think it's for similar reasons.

3duard0 commented 3 years ago

It seems to be related to how some characters are encoded in strings. The character ç is encoded as <<195, 167>>, instead of the ascii 231(ç in ascii).

michalmuskala commented 3 years ago

The input to Jason.decode! is defined as iodata encoding UTF-8 - this means in list of raw integers those integers represent bytes, not unicode codepoints. Jason does not support decoding from chardata (where raw integers represent codepoints), because that format is not very common for on-the-wire communication.

You can handle chardata correctly by converting it to iodata first, simplest of those methods being List.to_string/1 or indeed just the polymorphic to_string/1. If you need to keep it as a list, there's probably a function in the :unicode module that could be used for the conversion (though internally Jason operates on binaries only anyway).