Open jrudolph opened 9 years ago
the performance benefit of the safe API may be so small that it just isn't worth the additional logical overhead
@sirthias Can provide more detailed feedback, but essentially when you are constructing JSON (i.e. you are parsing JSON), Array provides much better performance than any of the immutable data structures. @sirthias Provided benchmarks for this in the gitter channel
The tldr is, there was no datastructure that I could find which would make most people even remotely happy. The other solution here of course, is to not use a concrete API
having to convert things from fast to safe and vice versa puts a double cost on the safe variant: first, it's not fast by design, and secondly, you will often need to convert things from the fast variant
I believe the intention is that something like a JSON parser would use the fast variant, and then people would have a choice of whether to use the fast unsafe variant OR convert it to safe. There seems to be 2 use cases, which is what the fast/safe split is trying to represent. One is parsing JSON and representing it as an AST, and the other is querying/passing JSON around
having two conflicting APIs makes it harder to provide full-featured APIs, would an implementation have to always accept both kinds of ASTs? Or, would it (automatically) fall back to calling toSafe?
This is why .toSafe
and .toFast
methods were created, regardless of whether a library gives you a fast.JValue
or a safe.JValue
, the user is always able to jump between the 2, and library creators can't restrict this
From a historical perspective, trying to make a single API didn't make anyone happy, although I can definitely see merit in picking one (probably safe
at this point), and let library authors use their own internal AST for speed. Issue here, of course is, there is less of a reason for them to support safe
(if they use fast
, safe
is already given)
Maybe we should enumerate the additional safety the safe
variant is supposed to provide:
Did I forget something?
IMO, we can hide all these insafeties if we make the AST abstract and just don't mention the optimizations. That way the optimized representation of the data can be private to the implementation.
In the normal user API and in the interaction with other libraries we don't lose any of the safety while an intra-library usage can still use the optimal representation to avoid costs if possible.
(To make things faster also between libraries - if that's a goal at all - we could still provide marker interfaces that would surface the underlying fast representation if we can agree on one.)
One solution could be to use a super type of Vector[A]
and Array[A]
to leave the choice of the implementation to the user/parser. In the current stblib LUB(Vector[A], Array[A]) = Object
, but WrappedArray[A] <: Seq[A]
and Vector[A] <: Seq[A]
. Since ASTs are ultimately about wrapping values into hierarchies of case classes I would say that wrapping Arrays is definetly worth having a single interface.
Did I forget something?
Its also meant to be "safe" in regards to performance, that is, having lookups for basic use cases which are no worse than eC (effective constant) lookup time, this is given by using Vector
and Map
.
IMO, we can hide all these insafeties if we make the AST abstract and just don't mention the optimizations. That way the optimized representation of the data can be private to the implementation.
In the normal user API and in the interaction with other libraries we don't lose any of the safety while an intra-library usage can still use the optimal representation to avoid costs if possible.
That is true, would have to redo the design for this
(To make things faster also between libraries - if that's a goal at all - we could still provide marker interfaces that would surface the underlying fast representation if we can agree on one.)
The current fast one I think is one that @non and @sirthias agreed on, they cared about the performance aspects of the AST
One solution could be to use a super type of Vector[A] and Array[A] to leave the choice of the implementation to the user/parser. In the current stblib LUB(Vector[A], Array[A]) = Object, but WrappedArray[A] <: Seq[A] and Vector[A] <: Seq[A]. Since ASTs are ultimately about wrapping values into hierarchies of case classes I would say that wrapping Arrays is definetly worth having a single interface.
The problem with leaving it to the parser is that users don't have the gurantee that they are given a safe JValue
(in much the same way that when we work with a String
right now, we know its always a valid string. Weird stuff isn't going to happen at runtime because some parser created an invalid string, and they chose not to check the correctness of the string due to performance reasons)
I make it more precise, the following AST can be used as JArray(List(JNull))
, JArray(Vector(JNull))
and JArray(Array(JNull))
:
sealed trait JValue
case object JNull extends JValue
case class JBool(value: Boolean) extends JValue
case class JNumber(value: String) extends JValue {
def toDouble: Double = value.toDouble
}
case class JString(value: String) extends JValue
case class JArray(value: Seq[JValue]) extends JValue
case class JObject(value: Seq[(String, JValue)]) extends JValue
Relevant discussion is also https://github.com/json4s/json4s-ast/issues/5 and https://github.com/json4s/json4s-ast/issues/9.
The problem with leaving it to the parser is that users don't have the gurantee that they are given a safe JValue (in much the same way that when we work with a String right now, we know its always a valid string. Weird stuff isn't going to happen at runtime)
I think we need to be very careful about our choices about being strictly safe versus being sensibly safe enough. We already accept that any reference may be null
which is something we cannot annotate in the types. Considering the ultimate malicious parser implementor what unsafer thing could be done than packing nulls somewhere?
@OlivierBlanvillain I would go with immutable.IndexedSeq
or even immutable.IndexedSeqOptimized
which implies constant-time access. This is a somewhat clean interface where you only need to implement length
and apply
. This should be fast and can be backed by an array at which point there shouldn't be much of a performance cost at all.
I'll try to come up with a concrete suggestion tomorrow.
I think we need to be very careful about our choices about being strictly safe versus being sensibly safe enough. We already accept that any reference may be null which is something we cannot annotate in the types.
I think nulls is an exception that we have to deal with, due to Java. If we had the choice, I think a very small minority of people would argue for null in its current form, so I am not sure its a completely valid comparison
From the people that are arguing for a safe version, the impression I got is that they wanted a JSON version of String, something that is immutable and sensibly correct. From my rudimentary knowledge of parsers, the odds of parsers that are purely focused on speed of creating something slightly wrong isn't a once in a blue moon kind of thing
Considering the ultimate malicious parser implementor what unsafer thing could be done than packing nulls somewhere?
There could be silent overflows, or things like Infinity appearing as a String in a JNumber. There are also the issues of duplicate keys which are being discussed in length elsewhere
At least personally I think the current level of safe is sensible. Its still not completely safe, BigDecimal will still swallow really large numbers, and Scala collections have a memory limit (so you can't story a really massive JSON).
@rossabaker , One of the current owners of json4s/http4s, was arguing heavily for Vector, so was @bryce-anderson. IndexedSeq defaults to Vector iirc anyways?
I don't recall one way or the other as to my past preference for Vector
vs Whatever
and withdraw any strong opinions other than it should be optimized for random access. There are strong cases for using Vector
or IndexedSeq
which have their respective strengths in different situations. If I had to testify under oath, I'd say Vector
for simplicity/clarity and its actually (allegedly) pretty fast if you use the builders instead of :+
etc.
Just letting you guys know, the default implementation of an IndexedSeq
is a Vector
https://github.com/scala/scala/blob/v2.11.7/src/library/scala/collection/IndexedSeq.scala#L25-L29
I don't recall one way or the other as to my past preference for Vector vs Whatever and withdraw any strong opinions other than it should be optimized for random access.
I think it was mainly @rossabaker . Apologies if I was putting words into your mouth, wasn't intentional!
@mdedetrich, no worries, I very well could have said it, and I will say I do have a slight preference for Vector
! However, as a counter point, using the abstract base of IndexSeq
allows you do use whatever is most appropriate. Vector
is used in the package object but that doesn't mean someone can't roll their own MyWrappedArray
type etc if it is deemed appropriate for their use case. For some situations this could mean the difference between writing a wrapper class around an Array
that can be kept private and avoiding a translation to Vector
.
Ultimately the decision here comes down to some degree of personal preference/experience and specific use cases. Vector
and IndexedSeq
/IndexSeqOptimized
would both be suitable IMO, my preference for Vector
is grounded in knowing what I can expect performance wise and that it guarantees immutability (who says someone can't mutate their MyWrappedArray
).
On the topic of the thread, I also find it difficult to swallow two AST's and if a choice needs to be made obviously the safe version is the most useful. I suspect someone who needs a "fast" AST might have their own specific needs anyway and even small deviations from their requirements will cause them to roll their own. _Note: this is just speculation on my part._
Btw @mdedetrich, I appreciate the effort you're putting into this. I noticed the SLIP. Its an important discussion to have whether the SLIP is accepted or not: it will help set a precedent for how future work in the standard library will be approached.
On the topic of the thread, I also find it difficult to swallow two AST's and if a choice needs to be made obviously the safe version is the most useful. I suspect someone who needs a "fast" AST might have their own specific needs anyway and even small deviations from their requirements will cause them to roll their own.
I'm with you on this.
Regarding the choice of a type for JArray
(assume no safe/fast separation) I really think that what matches best the design of the standard library in it's current state is Seq
, or IndexedSeq
if you really want to prevent people from using List
(I don't really see a reason for that, but why not).
Note that while your fast design might be the fastest possible (without getting into manual memory management like they do in Netty/Spark), the safe is definitely not the safest, the following fails at runtime:
val v = Vector(1, 2)
println(v(2))
While this typo can be detected at compile time with shapeless:
val l = HList(1, 2)
println(l(2))
Scala has it's limitation, but the language and it's standard library were designed to be a compromise between speed and safety. I could be safer, could be faster, but I don't think a JSON AST should have to deal with these details by exposing two API. There is a way to do it in a reasonably safe way with basically no performance lost, it's a Seq
/IndexedSeq
backed by an Array
.
Does the distinction really pulls its weight?
Arguments against:
toSafe
?