michalmuskala / jason

A blazing fast JSON parser and generator in pure Elixir.
Other
1.6k stars 170 forks source link

JSON parsing issue #138

Closed nacengineer closed 3 years ago

nacengineer commented 3 years ago

When trying to parse the following json

{"issue":{"description":"Lorem Ipsum\r\n\"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit...\"\r\n\"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain...\"\r\nWhat is Lorem Ipsum?\r\nLorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\r\n\r\nWhy do we use it?\r\nIt is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).\r\n\r\n\r\nWhere does it come from?\r\nContrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of \"de Finibus Bonorum et Malorum\" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, \"Lorem ipsum dolor sit amet..\", comes from a line in section 1.10.32.\r\n\r\nThe standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from \"de Finibus Bonorum et Malorum\" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.\r\n\r\nWhere can I get some?\r\nThere are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."}}

which I got via a lorem ipsum generator page... both Poison and Jason blows up on decoding. I believe it's due to the windows line endings?

I'm not exactly sure if this is a bug or ?

I noticed Poison is kind of MIA w.r.t. issue tracker so figured I'd post it here as your lib seems much more responsive.

michalmuskala commented 3 years ago

I cannot reproduce it decodes for me just fine.

iex(1)> Mix.install([:jason])
:ok
iex(2)> Jason.decode!(File.read!("foo.json"))
%{
  "issue" => %{
    "description" => "Lorem Ipsum\r\n\"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit...\"\r\n\"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain...\"\r\nWhat is Lorem Ipsum?\r\nLorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\r\n\r\nWhy do we use it?\r\nIt is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).\r\n\r\n\r\nWhere does it come from?\r\nContrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of \"de Finibus Bonorum et Malorum\" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, \"Lorem ipsum dolor sit amet..\", comes from a line in section 1.10.32.\r\n\r\nThe standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from \"de Finibus Bonorum et Malorum\" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.\r\n\r\nWhere can I get some?\r\nThere are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."
  }
}

I added the concrete source I used to the gist: https://gist.github.com/michalmuskala/ddcd077cca58a01a0d2590c455399a45

nacengineer commented 3 years ago

hmm, could it be OS specific or iex/erl specific?

FWIW I'm on macOS newest patch with

Erlang/OTP 24 [erts-12.0.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit] [dtrace] Interactive Elixir (1.12.2) - press Ctrl+C to exit (type h() ENTER for help)

The error I get is

iex(2)> json |> Jason.decode!()                                                                                                                
** (Jason.DecodeError) unexpected byte at position 25: 0x4C ('L')                                                                                  
     (jason 1.2.2) lib/jason.ex:78: Jason.decode!/2

Also I wonder if File.read cleared out the line endings issue for you, which I think is what is causing this one. I'm doing it in console (this is a string pulled down from an API so no File write intermediary, although that's probably a good workaround now that I think about it )

then I try parsing as such

json = ~s|<jsonstring>| 
json |> Jason.decode!()
nacengineer commented 3 years ago

Follow up. Can confirm if you paste the raw json into a file and save said file. The issue will clear as the File operation presumably fixes the line ending issues.

Keep in mind this is probably a pretty wacky corner case around escaping, as I basically copied this from this website and pasted into a text box on slack (slack api spelunking) for upload to a Redmine API. Redmine didn't choke on it and it and Postman both parse out the json fine. It's just when I try decoding it in Elixir that it blows up.

michalmuskala commented 3 years ago

I really don't think this has anything to do with line endings. The reported position - 25, comes in the string before any line ending. In general JSON does not interpret line endings in any way, so this seems especially unlikely to be the issue.

michalmuskala commented 3 years ago

Ok. I looked into it again. The issue is indeed line returns, there was a bug in how errors were reported, fixed in https://github.com/michalmuskala/jason/commit/25fd65ed1a5cbb52d719145cdcfd622662521142.

The actual issue is not with Jason, though, but with how you paste data into iex. ~s causes Elixir itself to interpret the escape sequences inside the string and thus you end up with the \n and \r bytes inside JSON, which is not allowed by the standard. What you want is to use ~S to get raw uninterpreted bytes into the string, thus ending up with \ and n or \ and r 2-byte escape sequences, that are later decoded by Jason according to the standard.

nacengineer commented 3 years ago

ah! so a me problem. Which explains why it happened both in Jason and Poison. Thanks for looking back into it and educating me! 🙂👍