Open tnielens opened 2 years ago
Some questions arise:
At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.
would the binary operations execute the op in the missing positions (to omit the branching) then union the mask?
That's an approach indeed. So as to make sure to enable vectorization. To be validated with benchmarking. Also, the vector could have a field hasNAs
. If the field is false, the whole NA processing could be skipped.
what would be the return type of the method which extracts a value? (Now raw )
There are two variants if I understand correctly. The boxed one returning Scalar
. That one would remain untouched. The raw
one could also remain untouched as well with the note that if the corresponding position in the vector is a NA, the value returned has no meaning and could be anything.
is the mask sparse or dense?
I'd start with dense for simplicity? also along the hasNAs
idea above, the bool array could not be allocated if the constructor ensures there is no NA values. That said, since Vec is a mutable structure, that might not be good idea.
At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.
These aspects of saddle are surprising imo. I assumed these 2 implementations were semantically equivalent.
I see. I think I concur with most of these.
My only remaining concern is that I believe there should be a way to extract a Double from a Vec[Double] without going through Scalar, except if we are sure that the VM eliminates the allocations the Scalar on branch less code path (e.g. vec.apply(0).get
). This is to enable tight loops on a Vec.
Currently, saddle uses one value of each primitive type to represent
NA
. For floating point numbers, this is straightforward as they already include such a value. For other types (Boolean
,Byte
,Int
, etc), it isn't straightforward and an arbitrary value must be used. Currently the minimum value is used (Byte.MinValue
,Short.MinValue
, etc).I think this approach has important drawbacks:
MinValue
would result in a missing value.if (tag.isMissing(v1)) v1 else v1 + 2
might prevent loop optimizations of the jvm to kick-in..raw(i)
-like api exposes unnecessary complexity to the user.An alternative approach would be to use a mask-based implementation for the integer-based
Vec[T]
s. That is, the vector stores a companionArray[Boolean]
indicating missing value. This approach is used by pandas.