pityka / saddle

SADDLE: Scala Data Library
https://pityka.github.io/saddle/
Apache License 2.0
38 stars 11 forks source link

[proposition] mask-based implementation for missing values #441

Open tnielens opened 2 years ago

tnielens commented 2 years ago

Currently, saddle uses one value of each primitive type to represent NA. For floating point numbers, this is straightforward as they already include such a value. For other types (Boolean, Byte, Int, etc), it isn't straightforward and an arbitrary value must be used. Currently the minimum value is used (Byte.MinValue, Short.MinValue, etc).

I think this approach has important drawbacks:

An alternative approach would be to use a mask-based implementation for the integer-based Vec[T]s. That is, the vector stores a companion Array[Boolean] indicating missing value. This approach is used by pandas.

pityka commented 2 years ago

Some questions arise:

At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.

tnielens commented 2 years ago

would the binary operations execute the op in the missing positions (to omit the branching) then union the mask?

That's an approach indeed. So as to make sure to enable vectorization. To be validated with benchmarking. Also, the vector could have a field hasNAs. If the field is false, the whole NA processing could be skipped.

what would be the return type of the method which extracts a value? (Now raw )

There are two variants if I understand correctly. The boxed one returning Scalar. That one would remain untouched. The raw one could also remain untouched as well with the note that if the corresponding position in the vector is a NA, the value returned has no meaning and could be anything.

is the mask sparse or dense?

I'd start with dense for simplicity? also along the hasNAs idea above, the bool array could not be allocated if the constructor ensures there is no NA values. That said, since Vec is a mutable structure, that might not be good idea.

At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.

These aspects of saddle are surprising imo. I assumed these 2 implementations were semantically equivalent.

pityka commented 2 years ago

I see. I think I concur with most of these.

My only remaining concern is that I believe there should be a way to extract a Double from a Vec[Double] without going through Scalar, except if we are sure that the VM eliminates the allocations the Scalar on branch less code path (e.g. vec.apply(0).get). This is to enable tight loops on a Vec.