"fast" vs. "safe" - Githubissues

jrudolph commented 9 years ago

Does the distinction really pulls its weight?

Arguments against:

the performance benefit of the safe API may be so small that it just isn't worth the additional logical overhead
having to convert things from fast to safe and vice versa puts a double cost on the safe variant: first, it's not fast by design, and secondly, you will often need to convert things from the fast variant
having two conflicting APIs makes it harder to provide full-featured APIs, would an implementation have to always accept both kinds of ASTs? Or, would it (automatically) fall back to calling toSafe?

mdedetrich commented 9 years ago

the performance benefit of the safe API may be so small that it just isn't worth the additional logical overhead

@sirthias Can provide more detailed feedback, but essentially when you are constructing JSON (i.e. you are parsing JSON), Array provides much better performance than any of the immutable data structures. @sirthias Provided benchmarks for this in the gitter channel

The tldr is, there was no datastructure that I could find which would make most people even remotely happy. The other solution here of course, is to not use a concrete API

having to convert things from fast to safe and vice versa puts a double cost on the safe variant: first, it's not fast by design, and secondly, you will often need to convert things from the fast variant

I believe the intention is that something like a JSON parser would use the fast variant, and then people would have a choice of whether to use the fast unsafe variant OR convert it to safe. There seems to be 2 use cases, which is what the fast/safe split is trying to represent. One is parsing JSON and representing it as an AST, and the other is querying/passing JSON around

having two conflicting APIs makes it harder to provide full-featured APIs, would an implementation have to always accept both kinds of ASTs? Or, would it (automatically) fall back to calling toSafe?

This is why .toSafe and .toFast methods were created, regardless of whether a library gives you a fast.JValue or a safe.JValue, the user is always able to jump between the 2, and library creators can't restrict this

From a historical perspective, trying to make a single API didn't make anyone happy, although I can definitely see merit in picking one (probably safe at this point), and let library authors use their own internal AST for speed. Issue here, of course is, there is less of a reason for them to support safe (if they use fast, safe is already given)

jrudolph commented 9 years ago

Maybe we should enumerate the additional safety the safe variant is supposed to provide:

JNumber.value is a String that points to a valid number
JObject.fields does not contain two fields with the same keys
JArray.elements returns an immutable and shareable data structure

Did I forget something?

IMO, we can hide all these insafeties if we make the AST abstract and just don't mention the optimizations. That way the optimized representation of the data can be private to the implementation.

In the normal user API and in the interaction with other libraries we don't lose any of the safety while an intra-library usage can still use the optimal representation to avoid costs if possible.

(To make things faster also between libraries - if that's a goal at all - we could still provide marker interfaces that would surface the underlying fast representation if we can agree on one.)

OlivierBlanvillain commented 9 years ago

One solution could be to use a super type of Vector[A] and Array[A] to leave the choice of the implementation to the user/parser. In the current stblib LUB(Vector[A], Array[A]) = Object, but WrappedArray[A] <: Seq[A] and Vector[A] <: Seq[A]. Since ASTs are ultimately about wrapping values into hierarchies of case classes I would say that wrapping Arrays is definetly worth having a single interface.

mdedetrich commented 9 years ago

Did I forget something?

Its also meant to be "safe" in regards to performance, that is, having lookups for basic use cases which are no worse than eC (effective constant) lookup time, this is given by using Vector and Map.

IMO, we can hide all these insafeties if we make the AST abstract and just don't mention the optimizations. That way the optimized representation of the data can be private to the implementation.

In the normal user API and in the interaction with other libraries we don't lose any of the safety while an intra-library usage can still use the optimal representation to avoid costs if possible.

That is true, would have to redo the design for this

(To make things faster also between libraries - if that's a goal at all - we could still provide marker interfaces that would surface the underlying fast representation if we can agree on one.)

The current fast one I think is one that @non and @sirthias agreed on, they cared about the performance aspects of the AST

One solution could be to use a super type of Vector[A] and Array[A] to leave the choice of the implementation to the user/parser. In the current stblib LUB(Vector[A], Array[A]) = Object, but WrappedArray[A] <: Seq[A] and Vector[A] <: Seq[A]. Since ASTs are ultimately about wrapping values into hierarchies of case classes I would say that wrapping Arrays is definetly worth having a single interface.

The problem with leaving it to the parser is that users don't have the gurantee that they are given a safe JValue (in much the same way that when we work with a String right now, we know its always a valid string. Weird stuff isn't going to happen at runtime because some parser created an invalid string, and they chose not to check the correctness of the string due to performance reasons)

OlivierBlanvillain commented 9 years ago

I make it more precise, the following AST can be used as JArray(List(JNull)), JArray(Vector(JNull)) and JArray(Array(JNull)):

sealed trait JValue
case object JNull extends JValue
case class JBool(value: Boolean) extends JValue
case class JNumber(value: String) extends JValue {
  def toDouble: Double = value.toDouble
}
case class JString(value: String) extends JValue
case class JArray(value: Seq[JValue]) extends JValue
case class JObject(value: Seq[(String, JValue)]) extends JValue

mdedetrich commented 9 years ago

Relevant discussion is also https://github.com/json4s/json4s-ast/issues/5 and https://github.com/json4s/json4s-ast/issues/9.

jrudolph commented 9 years ago

The problem with leaving it to the parser is that users don't have the gurantee that they are given a safe JValue (in much the same way that when we work with a String right now, we know its always a valid string. Weird stuff isn't going to happen at runtime)

I think we need to be very careful about our choices about being strictly safe versus being sensibly safe enough. We already accept that any reference may be null which is something we cannot annotate in the types. Considering the ultimate malicious parser implementor what unsafer thing could be done than packing nulls somewhere?

jrudolph commented 9 years ago

@OlivierBlanvillain I would go with immutable.IndexedSeq or even immutable.IndexedSeqOptimized which implies constant-time access. This is a somewhat clean interface where you only need to implement length and apply. This should be fast and can be backed by an array at which point there shouldn't be much of a performance cost at all.

jrudolph commented 9 years ago

I'll try to come up with a concrete suggestion tomorrow.

mdedetrich commented 9 years ago

I think we need to be very careful about our choices about being strictly safe versus being sensibly safe enough. We already accept that any reference may be null which is something we cannot annotate in the types.

I think nulls is an exception that we have to deal with, due to Java. If we had the choice, I think a very small minority of people would argue for null in its current form, so I am not sure its a completely valid comparison

From the people that are arguing for a safe version, the impression I got is that they wanted a JSON version of String, something that is immutable and sensibly correct. From my rudimentary knowledge of parsers, the odds of parsers that are purely focused on speed of creating something slightly wrong isn't a once in a blue moon kind of thing

Considering the ultimate malicious parser implementor what unsafer thing could be done than packing nulls somewhere?

There could be silent overflows, or things like Infinity appearing as a String in a JNumber. There are also the issues of duplicate keys which are being discussed in length elsewhere

At least personally I think the current level of safe is sensible. Its still not completely safe, BigDecimal will still swallow really large numbers, and Scala collections have a memory limit (so you can't story a really massive JSON).

@rossabaker , One of the current owners of json4s/http4s, was arguing heavily for Vector, so was @bryce-anderson. IndexedSeq defaults to Vector iirc anyways?

bryce-anderson commented 9 years ago

I don't recall one way or the other as to my past preference for Vector vs Whatever and withdraw any strong opinions other than it should be optimized for random access. There are strong cases for using Vector or IndexedSeq which have their respective strengths in different situations. If I had to testify under oath, I'd say Vector for simplicity/clarity and its actually (allegedly) pretty fast if you use the builders instead of :+ etc.

mdedetrich commented 9 years ago

Just letting you guys know, the default implementation of an IndexedSeq is a Vector

https://github.com/scala/scala/blob/v2.11.7/src/library/scala/collection/IndexedSeq.scala#L25-L29

I don't recall one way or the other as to my past preference for Vector vs Whatever and withdraw any strong opinions other than it should be optimized for random access.

I think it was mainly @rossabaker . Apologies if I was putting words into your mouth, wasn't intentional!

bryce-anderson commented 9 years ago

@mdedetrich, no worries, I very well could have said it, and I will say I do have a slight preference for Vector! However, as a counter point, using the abstract base of IndexSeq allows you do use whatever is most appropriate. Vector is used in the package object but that doesn't mean someone can't roll their own MyWrappedArray type etc if it is deemed appropriate for their use case. For some situations this could mean the difference between writing a wrapper class around an Array that can be kept private and avoiding a translation to Vector.

Ultimately the decision here comes down to some degree of personal preference/experience and specific use cases. Vector and IndexedSeq/IndexSeqOptimized would both be suitable IMO, my preference for Vector is grounded in knowing what I can expect performance wise and that it guarantees immutability (who says someone can't mutate their MyWrappedArray).

On the topic of the thread, I also find it difficult to swallow two AST's and if a choice needs to be made obviously the safe version is the most useful. I suspect someone who needs a "fast" AST might have their own specific needs anyway and even small deviations from their requirements will cause them to roll their own. _Note: this is just speculation on my part._

Btw @mdedetrich, I appreciate the effort you're putting into this. I noticed the SLIP. Its an important discussion to have whether the SLIP is accepted or not: it will help set a precedent for how future work in the standard library will be approached.

OlivierBlanvillain commented 9 years ago

On the topic of the thread, I also find it difficult to swallow two AST's and if a choice needs to be made obviously the safe version is the most useful. I suspect someone who needs a "fast" AST might have their own specific needs anyway and even small deviations from their requirements will cause them to roll their own.

I'm with you on this.

Regarding the choice of a type for JArray (assume no safe/fast separation) I really think that what matches best the design of the standard library in it's current state is Seq, or IndexedSeq if you really want to prevent people from using List (I don't really see a reason for that, but why not).

Note that while your fast design might be the fastest possible (without getting into manual memory management like they do in Netty/Spark), the safe is definitely not the safest, the following fails at runtime:

val v = Vector(1, 2)
println(v(2))

While this typo can be detected at compile time with shapeless:

val l = HList(1, 2)
println(l(2))

Scala has it's limitation, but the language and it's standard library were designed to be a compromise between speed and safety. I could be safer, could be faster, but I don't think a JSON AST should have to deal with these details by exposing two API. There is a way to do it in a reasonably safe way with basically no performance lost, it's a Seq/IndexedSeq backed by an Array.

json4s / json4s-ast

"fast" vs. "safe" #21