rust-ndarray / ndarray

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
https://docs.rs/ndarray/
Apache License 2.0
3.44k stars 297 forks source link

Named Tensors #599

Open Ricocotam opened 5 years ago

Ricocotam commented 5 years ago

Hi, Recently, HarvardNLP published a blogpost and proposed Named Tensors. The proposition is to name dimensions rather than using indexes. The repo is here.

Their second post shows how it is difficult to address such a problem once the framework is advanced. Here is my proposing : What do you think about adding this functionality in ndarray ?

I think it is what future frameworks should aim, this project being young it seems the right timing.

LukeMathWalker commented 5 years ago

Hi @Ricocotam! I have read the blog post when it came out and I found it quite interesting as well. Nonetheless, I don't think you can avoid index-based tensor access if you want to provide a foundational tensor data structure. Named tensors are one step higher in the abstraction chain and it might be interesting to try to build them using ndarray, similarly to NumPy's structure arrays. I don't know, though, if this should sit into ndarray or into a different crate, at least to start experimenting with it.

Ricocotam commented 5 years ago

I do agree with you, we shouldn't drop the index-based tensors. But I feel like having named tensors natively supported would be very great.

As I said, I'm really beginner in Rust so I can't help right now. You might be right about doing it into a different crate to begin with.

Edit : I just read the entire article on NumPy's structure arrays, I think it's not a good comparison (apart from being on top of numpy)

What I'm talking about (just to make it clear to other people reading this) is having a named base tensor indexing. Thus, when you do a dot product, you don't have to transpose, add or remove dimensions ...

A quick example from neural networks (in pseudo code python-like) :

#####
# Index based
batch_size, feature_size = 32, 300
x = Array(size=(batch_size, feature_size))
w = Array(size=(feature_size, 1))

# We would like to do w.dot(x)
w.dot(x)  # Error, dimension mismatch
w.dot(x.t())  # works
x.dot(w)  # works

######
# Named Based
x = NamedArray(size={"batch": batch_size, "features": feature_size})
w = NamedArray(size={"features": feature_size, "output": 1})

w.dot(x, index="features")  # No transpose, everything is crystal clear
jturner314 commented 5 years ago

Named tensors would be really nice. A couple of thoughts:

max-sixty commented 5 years ago

It is possible today to write a named tensor wrapper around dynamic-dimensionality arrays (ArrayBase<A, IxDyn>) without any issues. This would work well in a separate crate.

Where would you store the dimension names? On the type like const DIMENSIONS: &'static [&'static str]? Or defined at runtime?

max-sixty commented 5 years ago

I'm starting to explore this; initially with dimensions declared at runtime; something like:


pub struct NamedArray<S, D>
where
    S: RawData,
{
    array: ArrayBase<S, D>,
    dimensions: Vec<String>,
}

...and then I intend to add logic to match dimensions across arrays on various operations.

Is this a reasonable approach?

I'm on the xarray team so know named tensors well, but just starting out in rust, so would appreciate any guidance

jturner314 commented 5 years ago

Yes, that's essentially what I'd recommend, except that I'd suggest using IxDyn instead of D:

pub struct NamedArrayBase<S>
where
    S: RawData,
{
    array: ArrayBase<S, IxDyn>,
    dimensions: Vec<String>,
}

This is essentially equivalent to a Python class containing a NumPy array and a list of dimension names. In the future, you may look into strategies to reduce the cost of cloning axis names. (For example, you could use integers instead of strings for axis names.) However, I wouldn't worry too much about that initially.

The reason for just using IxDyn everywhere instead of generic D is due to operations across multiple arrays. For example, let's say you want to add two NamedArray instances. The number of dimensions in the output array is equal to (number of dimensions of left array) + (number of dimensions of right array) - (number of dimensions with the same name). Since the names are known only at runtime, the number of dimensions in the result is known only at runtime (even if you know the number of dimensions in each of the input arrays).

I suppose there's nothing wrong with generic D, it's just that pretty much all operations would end up returning a named array with dimensionality IxDyn.

As I discussed in my earlier comment, I think it will be possible to handle axis names in the type system (at compile time) if desired in the future with a few more language features, but AFAICT, it's not possible to do so in current Rust, so we should just use IxDyn for the time being.

Actually writing the implementations for various operations isn't especially straightforward at the moment given only the methods ndarray provides, since ndarray doesn't provide methods for operating on multiple axes when operating on a single array or matching arbitrary axes when operating over multiple arrays. (For example, adding an array with names ["a", "b", "c"] to an array with names ["a", "d", "c"] would result in an array with names ["a", "b", "c", "d"], of which axes "a" and "c" are common between the two input arrays. Given only the methods available today, the most obvious implementation would be get a view of each input array with .view(), re-order the axes using .permuted_axes(), and call .broadcast() for each view such that the resulting views correspond to axis names ["a", "b", "c", "d"]. The views could then be added together normally.)

To factor out the tedious and error-prone .permuted_axes() and .broadcast() combination, I'd recommend starting by implementing methods similar to the following:

pub type NamedArray<A> = NamedArrayBase<OwnedRepr<A>>;
pub type NamedArrayView<'a, A> = NamedArrayBase<ViewRepr<&'a A>>;
pub type NamedArrayViewMut<'a, A> = NamedArrayBase<ViewRepr<&'a mut A>>;

/// Returns the axis names and shape for the union of the axes in the arrays.
///
/// Returns `None` if the arrays have the same axis name but differing lengths
/// greater than 1 for that axis (i.e. if the arrays cannot be broadcast together).
pub fn shape_union<S1, S2>(arr1: &NamedArrayBase<S1>, arr2: &NamedArrayBase<S2>) -> Option<(Vec<String>, Vec<usize>)>
where
    S1: RawData,
    S2: RawData,
{
    // to be implemented
}

pub fn broadcast_together<'a, 'b, S1, S2>(
    arr1: &'a NamedArrayBase<S1>,
    arr2: &'b NamedArrayBase<S2>,
) -> Result<(NamedArrayView<'a, S1::Elem>, NamedArrayView<'b, S2::Elem>), BroadcastError>
where
    S1: Data,
    S2: Data,
{
    let (names, shape) = shape_union(arr1, arr2)?;
    Ok((arr1.broadcast(names.clone(), &shape)?, arr2.broadcast(names, &shape)?))
}

impl<A, S> NamedArrayBase<S>
where
    S: RawData<Elem = A>,
{
    pub fn view(&self) -> NamedArrayView<'_, A>
    where
        S: Data,
    {
        NamedArrayView {
            array: self.array.view(),
            dimensions: self.dimensions.clone(),
        }
    }

    /// Broadcasts `self` such that the resulting view has the specified shape and axis names.
    ///
    /// For each axis name:
    ///
    /// * If the name is the name of an axis of `self`,
    ///   `self.len_of()` must be `1` (in this case, the axis is lengthened)
    ///   or equal to the corresponding element of `shape` (in this case, the axis is unchanged).
    ///
    /// * If the name is *not* the name of any axis of `self`, a new axis is created.
    ///
    /// **Errors** if any `names` does not contain all the axis names of `self`
    /// or if any of the lengths specified in `shape` is incompatible with the existing shape.
    pub fn broadcast(&self, names: Vec<String>, shape: &[usize]) -> Result<NamedArrayView<'_, A>, BroadcastError>
    where
        S: Data,
    {
        // to be implemented using `permuted_axes` and `broadcast`
    }
}

Then, implementing addition for two arrays would be something like:

impl<'a, 'b, A1, A2, A3, S1, S2> std::ops::Add<&'b NamedArrayBase<S2>> for &'a NamedArrayBase<S1>
where
    S1: Data<Elem = A1>,
    S2: Data<Elem = A2>,
    A1: Add<A2, Output = A3> + Clone,
    A2: Clone,
{
    type Output = NamedArray<A3>;

    fn add(self, rhs: &'b NamedArrayBase<S2>) -> NamedArray<A3> {
        let (lhs, rhs) = broadcast_together(&self, &rhs).expect("error message...");
        NamedArray {
            array: &lhs.array + &rhs.array,
            dimensions: lhs.dimensions,
        }
    }
}

There really isn't a good way to handle operations folding over multiple axes using the existing methods of ndarray. I'm working on a project to provide this functionality (https://github.com/jturner314/nditer). It's mostly complete but needs more tests before I feel comfortable publishing it.

Please feel free to reach out with questions. I'm personally very interested in named tensor functionality.

max-sixty commented 5 years ago

Thanks a lot @jturner314 ! I'll come back to you with questions. I diverted some time to making Clippy pass, hopefully my current spell of spare time for OSS will continue...

kazimuth commented 4 years ago

See also the xarray Python library. I've started using it recently and I'm in love, never going back to plain numpy. Would love to see something like it for rust.

bluss commented 4 years ago

I've also been inspired by xarray, but it's hard to make it fit with Rust, at least if we try to emulate dynamic types. Something that was proposed — probably by @jturner314(!?) — would be something similar to what we do with deriving serialization formats using serde in Rust — use a derive or similar construct to define which array columns/variables exist, their names and types, and let code be generated for that kind of array structure. Then you have static types for a custom xarray-like thing.

kazimuth commented 4 years ago

I was wondering about something along those lines as well.

You could also do something like the string approach above, but with #[derive]'d enums instead of string keys. With some macros you might get close to a more seamless experience.

TobiasJacob commented 2 years ago

I think it would be awesome to particularly not have dynamic types, but keeping it static. If you catch invalid tensor operations at compile time, you save a lot of time that you would otherwise debug your code.

I am also fairly new to rust, but experienced with Numpy and PyTorch. The dynamic typing makes code extremely hard to develop for more than toy projects. There are many cases where it is unclear how dimensions are ordered because the former programmer didn't keep his code clean ( YoloV5's data loader for example has no comments at all about the tensor shape). Then if you figure it out, you run the code or submit the job to the cluster (because to model is too large to execute on your dev-computer), everything takes some time to load, and it crashes because of some other weird indexing errors. Or even worse, and you accidentally broadcast stuff, and then the model does not converge and you don't know why. (For example by accidentally broadcasting agent rewards so that they are applied to the whole batch instead of a single agent, because the shape was [BatchSize, 1] instead of [BatchSize] - happened to me). This would not happend with named tensors.

As shown at the named tensor notation repository, you don't even need many elementary functions for a lot of state of the art neural networks.

If rust would support a differentiable version of xdarray, ideally with GPU support (even though this is far into the future), with named dimensions checked on compile time and no runtime exceptions, I would immediately switch over from PyTorch!

TobiasJacob commented 2 years ago

One project having a implementation is this one: tch-typed-tensor And its follow-up: rust-typed-tensor

bionicles commented 10 months ago

I would use this

Jasha10 commented 3 months ago

I think it would be awesome to particularly not have dynamic types, but keeping it static. If you catch invalid tensor operations at compile time, you save a lot of time that you would otherwise debug your code.

There is some prior art at https://github.com/JuliaArrays/AxisArrays.jl, which uses Julia's type system to keep track of the names of array axes.

bionicles commented 2 months ago

Got typed tensors to work last year btw but the compile errors are pretty heinous with HList of TypeNum

I recommend const generics instead, but then you are stuck on nightly