API design for version 1.0.0

mpizenberg / elm-js-typed-array

JavaScript TypedArray API in elm

Mozilla Public License 2.0

3 stars 1 forks source link

API design for version 1.0.0 #5

Open mpizenberg opened 6 years ago

mpizenberg commented 6 years ago

Currently, the API of this package is split in 4 modules:

JsArrayBuffer for the underlying JavaScript ArrayBuffer API.
JsTypedArray defines the type JsTypedArray a b and provides polymorphic functions operating on typed arrays.
JsUint8Array provides functions to create arrays of type JsTypedArray Uint8 Int
JsFloat64Array provides functions to create arrays of type JsTypedArray Float64 Float

The purpose of this issue is to discuss this API and its structure. For now, only two creation modules are provided (JsUint8Array and JsFloat64Array). If everything turns ok, I'll add the other creation modules:

JsInt8Array
JsUint8ClampedArray
JsInt16Array
JsUint16Array
JsInt32Array
JsUint32Array
JsFloat32Array

The full current API documentation is available here: doc-2573a08.txt

As an overview of the API, here is a list of the functions provided.

JsUint8Array and JsFloat64Array modules:

Creation from scratch: zeros, repeat, initialize
Creation from existing data: fromBuffer, fromArray, fromList, fromTypedArray
Decoding from JS Value: decode

JsTypedArray module:

Types: JsTypedArray a b, Uint8, Float64
Interoperability: toList, toArray, encode
Basic requests: length, getAt, unsafeGetAt, buffer, bufferOffset
Predicates: all, any, findIndex, filter, and their indexed... equivalent
Comparison: equal
Extraction and appending: extract, append
Transformations: replaceWithConstant, map, map2, reverse, sort, reverseSort, indexedMap, indexedMap2
Reductions: join, foldl, foldr, foldl2, foldr2, foldlr, and their indexed... equivalent

ianmackenzie commented 6 years ago

I'm sure you've thought about this quite a bit, but the two type parameters on JsTypedArray still seems a bit awkward...I'd be interested to see what the API would be like either going the fully-generic route or fully-specialized route. In the fully-generic route JsTypedArray would have only a single type parameter (Float or Int) and the underlying data type would be hidden; that would mean things like equals would have to handle arrays of different underlying types, but that seems doable (although a performance hit). In the fully-specialized route there would just be separate JsFloat64Array, JsFloat32Array, JsUint8Array etc. each with their own set of operations (including ways to convert between them). A lot of duplication, but very explicit!

Why did you choose to add the Js prefixes? If this is meant as a prototype for what typed array support in Elm would look like, then it shouldn't be explicitly tied to a particular compile target...better to figure out how typed arrays should behave in general, and just currently implement that using JavaScript's typed arrays.

For JsFloat64Array.decode etc., does "throws an error" mean that your program will actually crash if the value being decoded is not a Float64Array? Is there no way to catch this and just fail decoding instead?

mpizenberg commented 6 years ago

Thanks for the feedback @ianmackenzie , these are quite interesting questions!

I'd be interested to see what the API would be like either going the fully-generic route or fully-specialized route.

Actually, we have discussed this a little bit in issue #1 with Francisco. He was more in favor of the fully-generic route I think. I went for the double type parameter for the type safety and to keep a visual clue about the type used. A function like map (\n -> n + 1) will not return the same array if used on an Uint8Array or Int8Array for example. Also, in the generic approach, a function like map2 : (Int -> Int -> Int) -> TypedArray Int -> TypedArray Int -> TypedArray Int could take two different typed arrays and wouldn't know which underlying type to use for the returned array.

The fully specialized route would mean that we get rid of the JsTypedArray module.It could become an internal non-exposed module used with care by each specialized module. This would mean that there wouldn't be to much code duplication, but a lot more boilerplate, that's for sure!

I'm not sure how current fromTypedArray functions could be defined though If a shared type does not exist. Currently for JsUint8Array for example: JsUint8Array.fromTypedArray : JsTypedArray a b -> JsTypedArray Uint8 Int.

Another issue with the fully specialized approach is the ability to extend the API. If someone wants to create a new function for their use case, they would have to write it for every type since they can't use the Native trick.

Why did you choose to add the Js prefixes?

This is in reference to how Array will work in 0.19. Array is coded as an elm module relying on a thin wrapper of JavaScript Array API called JsArray: https://github.com/elm-lang/core/blob/master/src/Elm/JsArray.elm

I figured this module could have the same role regarding any data structure that would like to be based on typed arrays. For example, I'm trying to make an elm-tensor module on top of this, that would be a type for multi-dimentional arrays for scientific computing, based on JsTypedArray.

Due to performance requirements on operations with two matrices for my tests with elm-tensor (additions, multiplications, ...), I added functions to work on two typed arrays, (map2, foldl2, ...) and "indexed" functions. Meaning elm-js-typed-array is not as thin as I initially thought.

I'm genuinely interested in what other persons think of this too. Should it be kept as thin and equivalent to JS TypedArray API possible or more versatile?

For JsFloat64Array.decode etc., does "throws an error" mean that your program will actually crash if the value being decoded is not a Float64Array?

Yes, and I was wondering how to do it differently ... Current implementation is as follow:

decode : Decoder (JsTypedArray Float64 Float)
decode =
    Decode.map fromValue Decode.value

fromValue : Decode.Value -> JsTypedArray Float64 Float
fromValue =
    Native.JsFloat64Array.fromValue

And Native.JsFloat64Array.fromValue is implemented like this:

function fromValue(value) {
  if (!(value instanceof Float64Array)) {
    throw "This is not a Float64Array";
  }
  return value;
}

I couldn't figure another way of decoding using the Json.Decode API. Any better idea welcomed :)

ianmackenzie commented 6 years ago

I'm not sure how current fromTypedArray functions could be defined though If a shared type does not exist. Currently for JsUint8Array for example: JsUint8Array.fromTypedArray : JsTypedArray a b -> JsTypedArray Uint8 Int.

Yeah, in the fully-specialized case you'd have to just have a bunch of specialized conversions like JSUint8Array.fromJsInt8Array. Verbose but explicit...

This is in reference to how Array will work in 0.19. Array is coded as an elm module relying on a thin wrapper of JavaScript Array API called JsArray: https://github.com/elm-lang/core/blob/master/src/Elm/JsArray.elm

I figured this module could have the same role regarding any data structure that would like to be based on typed arrays. For example, I'm trying to make an elm-tensor module on top of this, that would be a type for multi-dimentional arrays for scientific computing, based on JsTypedArray.

I think this probably affects a lot of other design decisions - if this is mean only as a base for higher-level modules, then following JavaScript closely certainly makes sense. Does that imply that this package would only be used from other Elm kernel code? (I assume JsArray will not be visible to non-kernel code).

Also, in the generic approach, a function like map2 : (Int -> Int -> Int) -> TypedArray Int -> TypedArray Int -> TypedArray Int could take two different typed arrays and wouldn't know which underlying type to use for the returned array.

Good point - hadn't thought of that. Fully generic is probably not the way to go.

I couldn't figure another way of decoding using the Json.Decode API. Any better idea welcomed :)

Off the top of my head, how about having Native.JsFloat64Array.fromValue return a Maybe Float64Array instead of a Float64Array? Should be relatively straightforward by constructing the JS object with __ctor__ field etc....then fromValue could be something like

fromValue : Decode.Value -> JsTypedArray Float64 Float
fromValue =
    Native.JsFloat64Array.fromValue
        |> Decode.andThen
            (\maybeArray ->
                case maybeArray of
                    Just array ->
                        Decode.succeed array

                    Nothing ->
                        Decode.fail "Value was not a Float64Array"
            )

mpizenberg commented 6 years ago

Off the top of my head, how about having Native.JsFloat64Array.fromValue return a Maybe Float64Array instead of a Float64Array?

Ah good idea ^^. Corrected in d4802fd.

I think this probably affects a lot of other design decisions - if this is mean only as a base for higher-level modules, then following JavaScript closely certainly makes sense. Does that imply that this package would only be used from other Elm kernel code? (I assume JsArray will not be visible to non-kernel code).

Yes I also assume that JsArray will stay private. In this discussion on elm-dev about NumElm, Evan seemed to be considering to open the module if needed though.

My intention is indeed to make this a thin wrapper around JS TypedArray. I'm pretty sure we are far from any integration into Core. So my purpose is to keep it easy to use for exploration for Web APIs requiring typed arrays.

Regarding performances, the main issue is the update of values of an array, that must create a new array to respect immutability. Taking this into account, the functions that cannot be removed are map, reverse, sort. Functions replaceWithConstant, reverseSort and indexedMap could be reimplemented on top but with a huge performance hit so I'd keep them anyway I think.

Providing that unsafeGetAt is performant, transformation functions from two arrays (map2 and indexedMap2) and reductions functions of two arrays (foldl2, foldr2, foldlr) could actually be reimplemented with a small performance price I think from their indexed... equivalent on one array. Cost would mainly be due to function calls. I have noticed in benchmarks the cost that can bring those function calls but I haven't specifically benchmarked the difference between a native implementation and another on top of an indexed... one for these. I shall probably do it, to estimate real cost of removing those from the package.

Yeah, in the fully-specialized case you'd have to just have a bunch of specialized conversions like JSUint8Array.fromJsInt8Array. Verbose but explicit...

Actually, there could be an alternative that would be to introduce a JsTypedArray type playing the role of an interchange data type. It would appear only in two functions in each module:

-- JsUint8Array module
toTypedArray : JsUint8Array -> JsTypedArray
fromTypedArray : JsTypedArray -> JsUint8Array

Considering that there would be 9 different types, 2*9=18 is a lot more manageable than 8*8=64 functions ^^. The toTypedArray function would be an identity function hidden in a Native call.

The only two remaining drawbacks are the boilerplate (included documentation to maintain) and the difficulty to extend for users who'd like to write new functions (like map2 if it is removed from package).

For now I still prefer the version with two type parameters but I will try to prepare a summary of good/bad points when I'll prepare a post on discourse to ask for feedback more broadly, so don't hesitate to make other remarks on things that bother you, or that you like with one or the other approach.

ianmackenzie commented 6 years ago

It occurred to me that another thing to look at for inspiration is Python's buffer protocol, since it's been pretty successful at enabling stuff like NumPy: https://docs.python.org/3/c-api/buffer.html

mpizenberg commented 6 years ago

I didn't know that python buffer API already had builtin multidimentional aspect (shape, strides, ...). I thought it was only a numpy overlay. The JavaScript ArrayBuffer API is focus on linear representation, with an additional offset and length to enable slicing of buffers in constant time without copy.

I think the multidimentional aspect will only be useful for math/media-like manipulation, that's why I'm building elm-tensor on top of this. If it's not the case, I guess we shall see when trying to build those examples use cases #7 .

LaurinStrelow commented 6 years ago

You have not asked for this, but i hope i can comment on the JsDataView API. I think it would be better if the API follows more closely the JSON.Decoder API. So instead of the get* functions you have a JsDataView.Decoder a type and the following functions

offset: Int -> Decoder a -> Decoder a

decodeInt32 : Decoder Int
decodeInt16 : Decode Int
decodeInt8 : Decode Int

decodeUInt8 : Decode Int
...

decodeFloat : Decoder Float
....

along with the standard map, map2, ..., andThen, fail, succeed, oneOf, list, ... functions. With a function like decodeJsDataView: Decoder a -> JSDataView -> Result String a you can run the decoder. This function should automatically check whether we are decoding something out of bounds.

mpizenberg commented 6 years ago

You have not asked for this, but i hope i can comment on the JsDataView API.

Of course @LaurinStrelow all feedback is welcomed!

I think it would be better if the API follows more closely the JSON.Decoder API.

You are absolutely right on the fact that getters, with offset (and endianness that I didn't add yet for this first draft) don't fit very well with the other ways elm has to bring data from foreign environments i.e. decoders on JS values.

In issue #6, @ianmackenzie also proposed an alternative approach, the parser approach. Knowing that data is stored sequentially in a buffer array, and not organized like in JS objects with attributes, I think this is even better than the decoder approach. Using decoders (and getters!) you will always have to provide the offset, which can be pretty anoying for reading sequential data.

There are two reasons why I chose (for now) to go with getters.

That's the simplest thing I could do to quickly get something working. I don't have a lot of time for this project even if I'd really like to make something useful.
It fits well with the idea of the package, which is to be a minimalist wrapper of native JavaScript code. The less mixing of elm and JS in this project the better.

Point 2 (minimalist native package) was derived from the successful approach of Skinney/elm-array-exploration which splits its array implementation in two parts. First a minimal wrapper of JavaScript array in native code (Native/JsArray.js and Array/JsArray.elm), and second a pure elm implementation on top of it (Array/Hamt.elm).

My main concern about this separation is that the native code has to be minimalist, but also provide enough functionalities to compensate for the main issue of elm data structures, immutability (which is it's main advantage if we see the bigger picture). And I believe the only way to get it right is by trial and error. My ongoing experience with elm-tensor was exactly that. It's making me iterate a lot over TypedArray API, especially to provide creation or modification of arrays avoiding multiple passes.

The part of the DataView API that I couldn't figure out how to do in a minimalist way was the setters. In JavaScript, you can easily write on a buffer by modifying it. In elm, doing so using minimalist setters, in an immutable way, is a performance chasm. That's the reason why for now, I put setters in a MutableJsDataView module.

The parser approach, while being IMO superior for reading sequential data from buffer don't offer any solution for the writing of data. The Decoder/Encoder approach however, provides a coherent approach for writing and reading data.

Anyway, at some point, you have to read the data using native code, calling JS getUint8 etc.. I think you can just consider the corresponding elm functions as minimalist wrappers around those. Ideally, what would be very valuable would be if you could try to implement a decoder approach on top of this and report on the limitations you encountered.

Let me know what you think, thanks again!

LaurinStrelow commented 6 years ago

Thank your for your answer.

Anyway, at some point, you have to read the data using native code, calling JS getUint8 etc.. I think you can just consider the corresponding elm functions as minimalist wrappers around those. Ideally, what would be very valuable would be if you could try to implement a decoder approach on top of this and report on the limitations you encountered.

I have not tried to implement this, but I think it should be possible to implement Decoders on top of the get* functions. Somehow I think that in the best case the library should just expose one (familiar) way to decode the data. So the Decoder and the Parser approach should both be good but we should have just one of both. (With both approaches you should be able to rebuild the get* functions with a line of code.)

Using decoders (and getters!) you will always have to provide the offset, which can be pretty anoying for reading sequential data.

This should depend on the data you are trying to decode. If your data is not sequential (and may require a lot of jumps to different positions), you may need a lot of extra afford to decode it. I am not familiar with the binary data formats so this might be no problem.

The part of the DataView API that I couldn't figure out how to do in a minimalist way was the setters. In JavaScript, you can easily write on a buffer by modifying it. In elm, doing so using minimalist setters, in an immutable way, is a performance chasm. That's the reason why for now, I put setters in a MutableJsDataView module.

I think you have a really performance critical use case in mind. The average elm developer should not bother an additional copy of their data. I think the most algorithms can be implemented with initialise and map in a performant way. However there are certain algorithms which cannot be implemented in the fastest way in Elm. To work around this, I think, it is the best to implement this specific algorithm as a native function. I understand if this defeats your purpose but the setter functions should really be avoided.