michalmuskala / jason

A blazing fast JSON parser and generator in pure Elixir.
Other
1.58k stars 168 forks source link

Jason.decode crashes when encountering complex emoji codepoints #147

Closed mirtyl-wacdec closed 1 year ago

mirtyl-wacdec commented 2 years ago

I was parsing a string containing unicode codepoints U+FE0F (Variation Selector-16) ‍U+200D (Zero Width Joiner)

and Jason.decode errored out.

Seems it's more of an Elixir issue with those complex codepoints, as String.at(string, <position given by Jason>) returns naked binaries e.g.<<179>>.

Not sure what to do, please help. String.replace(string, ~r/\x{FE0F}/u, "") takes forever, basically hangs there.

michalmuskala commented 2 years ago

I can't really reproduce. Both of the ways of encoding those codepoints (escaped and not) work just fine for me:

iex(1)> "\"\\uFE0F\""
"\"\\uFE0F\""
iex(2)> Jason.decode!("\"\\uFE0F\"")
"️"
iex(3)> "\"\uFE0F\"" <> <<0>>
<<34, 239, 184, 143, 34, 0>>
iex(4)> Jason.decode("\"\uFE0F\"")
{:ok, "️"}

Can you provide concrete reproduction steps?

mirtyl-wacdec commented 2 years ago

Thanks for the prompt response during holidays. My concrete steps were fetching json data from an API and then attempting to parse it. This is odd; I could replicate the error. Writing the string to a file and then running Jason.decode after reading that file produced the same error.

But manually saving the saved file on my editor (CTRL+S on VScode) and then running Jason.decode after reading that, manually saved file, fixed the problem. I guess VSCode runs some encoding editing on save.

I can't legally paste the content here and it probably wouldn't help if the act of moving the text around fixes the encoding issue. Not sure how to debug this without giving you direct access to the API.

michalmuskala commented 2 years ago

It's likely that the data you're receiving is not encoded in UTF-8 - Jason only processes JSON data encoded in UTF-8 as defined by latest standards. You could check with String.chunk(data, :valid) - if it returns more than one element, some of the data is not valid UTF-8.

It should be possible to either attach the file here, or you can always send me the file over email to michal at muskala dot eu.

michalmuskala commented 1 year ago

Given no way of reproducing the issue, I'm going to close this.