sirthias / borer

Efficient CBOR and JSON (de)serialization in Scala
https://sirthias.github.io/borer/
Mozilla Public License 2.0
224 stars 14 forks source link

Scala.js – How to decode JS byte array types? #675

Closed raquo closed 1 year ago

raquo commented 1 year ago

Hello, thank you for such a nice library! The docs are great too.

I am trying to use Borer in a Scala.js project, specifically, to parse JSON API responses that were encoded with Borer on the backend. Best I can figure out is this:

  import io.bullet.borer.*
  import io.bullet.borer.derivation.MapBasedCodecs.*
  import org.scalajs.dom
  import java.nio.charset.StandardCharsets
  import scala.concurrent.Future
  import scala.concurrent.ExecutionContext.Implicits.global

  case class Foo(id: String)

  given Codec[Foo] = deriveCodec

  val result: Future[Foo] = dom.Fetch.fetch("/api/foo.json").toFuture
    .flatMap { (response: dom.Response) =>
      response.text().toFuture
    }.map { (str: String) =>
      Json.decode(str.getBytes(StandardCharsets.UTF_8)).to[Foo].value
    }

However, this is not ideal: it encodes the raw response bytes into a UTF-16 Javascript string (.text()), then decodes that into an array of UTF-8 bytes (str.getBytes(StandardCharsets.UTF_8)), which is what Borer expects (...right?). I'm actually not sure that this is strictly correct in terms of encodings, but it seems to work at first glance.

This seems inefficient, because the dom.Response class already provides a js.typedarray.ArrayBuffer, and I can build a js.typedarray.UInt8Array from it, which is the Javascript equivalent to scala.Array[Byte].

But, while Borer can decode scala.Array[Byte] thanks to ByteAccess.ForByteArray implicit instance, there is no similar instance of ByteAccess[js.typedarray.UInt8Array]. I don't feel solid enough about all this stuff to implement it myself, and I'm not sure if it is even likely to provide a performance benefit? I see that I would need to implement toByteArray method, which would need to convert js.typedarray.UInt8Array to scala.Array[Byte], and well, if I had such a conversion available, I could just call it on the js.typedarray.UInt8Array myself, and use the ByteAccess.ForByteArray implicit instance, right? Like so:

  val result: Future[ApiResponse[Foo]] = dom.Fetch.fetch("/api/foo.json").toFuture
    .flatMap { (response: dom.Response) =>
      response.arrayBuffer().toFuture
    }.map { (arrayBuffer: js.typedarray.ArrayBuffer) =>
      val jsByteArray = new js.typedarray.Uint8Array(arrayBuffer)
      val numBytes = arrayBuffer.byteLength
      val byteArray = new Array[Byte](numBytes)
      var i = 0
      while (i < numBytes) {
        byteArray(i) = jsByteArray(i).asInstanceOf[Byte] // both these number types have the same runtime representation in JS
        i += 1
      }
      Json.decode(byteArray).to[ApiResponse[Foo]].value
    }

So I guess the issue is about adding some built-in way to use JS types, or at least documenting how to do that, because that's quite a bit more boilerplate than is needed for Scala and Java types.

mushtaq commented 1 year ago

@raquo I had done something similar a long time ago and it looked like this:

import scala.scalajs.js.typedarray.*

val result: Future[Foo] = dom.Fetch.fetch("/api/foo.json").toFuture
.flatMap { (response: dom.Response) =>
  response.arrayBuffer().toFuture
}
.map { (buf: ArrayBuffer) =>
  val byteBuffer = TypedArrayBuffer.wrap(buf)
  val bytes      = new Array[Byte](byteBuffer.remaining())
  byteBuffer.get(bytes)
  Json.decode(bytes).to[Foo].value
}
mushtaq commented 1 year ago

I think even this should work:

val result: Future[Foo] = dom.Fetch.fetch("/api/foo.json").toFuture
.flatMap { (response: dom.Response) =>
  response.arrayBuffer().toFuture
}
.map { (buf: ArrayBuffer) =>
  Json.decode(TypedArrayBuffer.wrap(buf)).to[Foo].value
}
raquo commented 1 year ago

Thanks! I actually had an older version of scala-js checked out locally and didn't notice this new TypedArrayBuffer helper!

I looked into how byteBuffer.get(bytes) is implemented, and it seems to be copying each element in a loop similar to my while loop, but with more abstraction. I haven't benchmarked anything, but I'll assume that the raw JS while loop is a bit faster, or at least, not slower.

TypedArrayBuffer.wrap(buf) looks promising, but somehow it does not work, Borer fails to parse the byte array. This:

  case class Foo(id: String)
  given Codec[Foo] = deriveCodec

  dom.Fetch.fetch("/foo.json").toFuture
    .flatMap { (response: dom.Response) =>
      response.arrayBuffer().toFuture
    }
    .map { (buffer: js.typedarray.ArrayBuffer) =>
      val tab = js.typedarray.TypedArrayBuffer.wrap(buffer)
      try {
        Json.decode(tab).to[Foo].value
      } catch {
        case err: Throwable => println(err)
      }
    }

When given foo.json with content {"id": "foo"} (+ newline before EOF), this prints out:

io.bullet.borer.Borer$Error$InvalidInputData: Expected ',' or '}' but got 'o' (input position 9)

I tried to debug this but not sure what's wrong, if I manually ask the typed array buffer for its contents (tab.get(ix)), it gives me the expected bytes: 123 34 105 100 34 58 32 34 102 111 111 34 125 10.

I'll stick to manual copying in a loop for now, I guess.

sirthias commented 1 year ago

When given foo.json with content {"id": "foo"} (+ newline before EOF), this prints out:

io.bullet.borer.Borer$Error$InvalidInputData: Expected ',' or '}' but got 'o' (input position 9)

I tried to debug this but not sure what's wrong, if I manually ask the typed array buffer for its contents (tab.get(ix)), it >gives me the expected bytes: 123 34 105 100 34 58 32 34 102 111 111 34 125 10.

Hmm... this is strange. Can you post the debugging output that you get when you enable printLogging as such?:

Json.decode(tab).withPrintLogging().to[Foo].value
raquo commented 1 year ago

Oh, debug mode, nice! That's what it prints when the TypedArrayBuffer version fails:

    1| *{ 
    1|     "of" 
    1|     -> ""

And if I use my while logic, which does work, it gives me this:

    1| *{
    1|     "id"
    1|     -> "foo"
    1| }
    2| END

Seeing a hint of reversal ("of"), I thought on a whim that maybe a palindrome json ({"id":"di"}) would work, but no, that fails too.

sirthias commented 1 year ago

Ok, that looks like an encoding and/or byte order mismatch. I haven't really done much with scala.js myself so far, so I'm afraid I won't really be of much help but these untyped number arrays the JS platform gives us sure are soo nice! They make things so much easier! No ugly typing issues to think about. Everything just magically works. NOT! </rant>

raquo commented 1 year ago

Well, I've reduced the reproduction down to:

    import io.bullet.borer.*
    import io.bullet.borer.derivation.MapBasedCodecs.*
    import org.scalajs.dom
    import scala.scalajs.js
    import scala.scalajs.js.JSConverters.*

    case class Foo(id: String)

    given Codec[Foo] = deriveCodec

    println("---V1")

    val str = """{"id": "foo"}"""
    val strByteArray = str.getBytes

    val byteBuffer = ByteBuffer.wrap(strByteArray)
    Json.decode(byteBuffer).withPrintLogging().to[Foo].value // Works

    println("---V2")

    val x = strByteArray.toJSArray
    println(">>" + x) // all good
    val int8Array = new js.typedarray.Int8Array(x)
    dom.console.log(int8Array) // all good
    val buffer = js.typedarray.TypedArrayBuffer.wrap(int8Array.buffer)
    {
      // Optional block for debugging. Same result with or without it
      var s = ""
      while (buffer.hasRemaining) {
        val b = buffer.get()
        s += b.toChar
      }
      println("> manually iterated: " + s)
      buffer.rewind()
    }
    Json.decode(buffer).withPrintLogging().to[Foo].value // Fails

V1, which works, uses a HeapByteBuffer, whereas V2, which fails, uses TypedArrayByteBuffer, both are java subclasses of ByteArray that are re-implemented in Scala.js without looking at the Java source code (see here).

If there is nothing wrong in Borer, I suspect there could be a bug in Scala.js implementation of TypedArrayByteBuffer. I can't see a difference in the ByteArray-s that they produce just using the .get method, so it must be some other methods.

Is there perhaps a simpler Borer call that I could give the Scala.js folks for debugging instead of the macro-based Json.decode(buffer).withPrintLogging().to[Foo].value, to help them see which methods on ByteArray are being called by Borer?

sirthias commented 1 year ago

Ok, just looked into this. It's indeed a ByteOrder problem. If you change the final line to this:

buffer.order(ByteOrder.BIG_ENDIAN)
Json.decode(buffer).withPrintLogging().to[Foo].value // Fails

things start working.

borer's FromByteBufferInput requires the ByteBuffer to have BIG_ENDIAN byte order but neither sets nor verifies this, which isn't good.

My current feeling that it'd be best to simply set the byte order to BIG_ENDIAN before starting to parse. If it was LITTLE_ENDIAN before then borer will have changed it but it's still possible to simply reset to LITTLE_ENDIAN manually afterwards if one really should want to reuse the buffer.

sirthias commented 1 year ago

One could also question the current state that js.typedarray.TypedArrayBuffer.wrap creates a buffer with native byte order while java.nio.ByteBuffer.wrap always creates a BIG_ENDIAN buffer. Maybe that's worth a ticket on the scala.js side.

raquo commented 1 year ago

Ah, I see, thanks for getting to the bottom of this!

I really don't know what the endianness should be, to make a case either way. For all the classes and methods that are available on the JVM, Scala.js needs to match their behaviour for compatibility, and it does so for java.nio.ByteBuffer.wrap, but TypedArrayBuffer is their JS-only construct that is not available on JVM, so it uses the native byte order for some reason. It's definitely intentional as there are comments about it. Perhaps they need or want to match JS semantics for that one, for some reason.

I don't feel too good about borer modifying the input's byte order if it will not be reliably setting it back to its original value at the end. Seems like modifying the input's byte order could potentially cause similar problems with other libraries if they make different assumptions about endian-ness. Personally I would rather borer check for endianness, and throw a exception describing the problem if the order is wrong, and for this order requirement to be documented. This would be an improvement over the status quo without the need for bigger changes.

On my end I'll just set the order manually for now.

sirthias commented 1 year ago

I just released version 1.12.0 containing the verification logic that the consumed ByteBuffer indeed has BIG ENDIAN byte order configured. If not it will throw an IllegalArgumentException explaining the problem and possible resolutions. Thanks for reporting this issue!

raquo commented 1 year ago

Nice, thanks!