rust-ndarray / ndarray-stats

Statistical routines for ndarray
https://docs.rs/ndarray-stats
Apache License 2.0
201 stars 25 forks source link

Argmin argmax skipnan #33

Closed phungleson closed 5 years ago

phungleson commented 5 years ago

Implement argmin_skipnan and argmax_skipnan

LukeMathWalker commented 5 years ago

Considering the discussion we are having in #32 I think it makes sense to align this PR to it as well :+1:

phungleson commented 5 years ago

Cool thanks @LukeMathWalker I have updated the docs to reflect the changes.

Do you guys want to update from indexed_iter() to some sort of fold() to reflect that changes as well?

phungleson commented 5 years ago

Looks like with fold we can only access the value but not the pattern, do we have similar function that allows us to calculate the pattern?

LukeMathWalker commented 5 years ago

I don't think, right now, there is a performance difference between a for loop and fold, given that you are using the same iteration strategy indexed_iter - @jturner314 can correct me if I am wrong.

Could you add a property test for both functions similar to https://github.com/jturner314/ndarray-stats/blob/7df07289c3a378154be51642e7ce610312b286f0/tests/quantile.rs#L28?

phungleson commented 5 years ago

Hey mate, I have made the returns clearer and added quickcheck tests

phungleson commented 5 years ago

Also switch to use fold, let me know if it looks alright

LukeMathWalker commented 5 years ago

It looks good to me. I'll let it open another day to give @jturner314 the opportunity to chip in if he has any observation. Good work @phungleson!

phungleson commented 5 years ago

Cool thanks @LukeMathWalker let me know if you want to convert for to fold for arg_min/max as well if it makes any performance sense.

On a separate random thought, if ndarray stores elements in certain order then these kind of operation should be fast? is there a thing like that in ndarray already? is it an important feature? Sorry for the questions due to my lacks of understanding of ndarray here and in other languages.

jturner314 commented 5 years ago

Thanks for working on this @phungleson!

I think it would be nice to abstract out some of the logic with an indexed_fold_skipnan function (similar to the fold_skipnan function used by min/max_skipnan). Additionally, the quickcheck tests aren't quite right. I've created phungleson/ndarray-stats#1 to add the indexed_fold_skipnan function and fix the tests.

Edit: Whoops, it looks like I added back the map that @LukeMathWalker objected to. I just pushed another commit to phungleson/ndarray-stats#1 to take the map back out.

I don't think, right now, there is a performance difference between a for loop and fold

That's correct. IndexedIter doesn't specialize fold, so .indexed_iter().fold() falls back to the standard library's implementation of fold, which is implemented in terms of try_fold, which is implemented in terms of while let with .next(). for loops are essentially implemented this way, so I wouldn't expect to see any performance difference. However, it would be possible for IndexedIter to specialize fold in the future and possibly achieve somewhat better performance.

On a separate random thought, if ndarray stores elements in certain order then these kind of operation should be fast? is there a thing like that in ndarray already? is it an important feature?

You seem to be asking about iteration order in relation to memory layout. ndarray allows very flexible memory layout of array elements (arbitrary strides for each axis). There are advantages to iterating over elements in the same order they're laid out in memory, so some operations such as ArrayBase.fold adjust the order of iteration to maximize performance. That's why arr.fold() should generally be faster than arr.iter().fold(). There are some more optimizations that we could perform but aren't yet (e.g. merging axes and reordering more axes than we do currently). Does that answer your question?

phungleson commented 5 years ago

@jturner314 Thanks for the explanation, all good!

The implementation is also very good, thanks!

BTW, PR is probably not a right place to ask this (if it is inappropriate here that is all good too thanks!) but my head is still hurt with the lifetime here, it would be great if you have time for quick explanation or point me to some source to read.

Does this mean? indexed_fold_skipnan has a lifetime 'a that is the lifetime of &self and &A::NotNan in FnMut should also lives at least 'a period, why does A needs to live at least 'a, is it because of &A::NotNan? what happens if A doesn't live at least 'a?

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        A: 'a,
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;
jturner314 commented 5 years ago

Does this mean? indexed_fold_skipnan has a lifetime 'a that is the lifetime of &self and &A::NotNan in FnMut should also lives at least 'a period

Yes. 'a is a lifetime parameter to indexed_fold_skipnan. self is borrowed for lifetime 'a. The element type A must live at least as long as 'a. F takes references to A::NotNan that are valid for lifetime 'a.

why does A needs to live at least 'a, is it because of &A::NotNan?

Yes, that is one of the reasons, but somewhat indirectly.

First, it's helpful to understand what exactly it means for a type to live at least as long as a lifetime. Most types, such as f64, Vec<i32>, String, etc., live for the 'static lifetime (so they meet an A: 'a bound regardless of what 'a is) because they don't contain any references. Types have a shorter lifetime when they contain references. For example, &'b i32 lives for lifetime 'b, and Cow<'c, [f32]> lives for lifetime 'c. If A is &'b i32, it meets an A: 'a bound if 'b is at least as long as 'a (i.e. 'b: 'a). See this section of the book.

It's also worth pointing out that A: 'a will always be true in practice in the implementation of MaybeNanExt for ArrayBase<S, D> because A is the element type of the array, and we're taking a reference &'a self to the array (so the elements must live at least as long as 'a).

If we remove the A: 'a bound, the compiler says:

error[E0309]: the associated type `<A as maybe_nan::MaybeNan>::NotNan` may not live long enough
   --> src/maybe_nan/mod.rs:248:5
    |
248 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
249 | |     where
250 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;
    | |______________________________________________________^
    |
    = help: consider adding an explicit lifetime bound `<A as maybe_nan::MaybeNan>::NotNan: 'a`...
note: ...so that the reference type `&'a <A as maybe_nan::MaybeNan>::NotNan` does not outlive the data it points at
   --> src/maybe_nan/mod.rs:248:5
    |
248 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
249 | |     where
250 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;
    | |______________________________________________________^

error[E0309]: the associated type `<A as maybe_nan::MaybeNan>::NotNan` may not live long enough
   --> src/maybe_nan/mod.rs:313:5
    |
313 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, mut f: F) -> B
314 | |     where
315 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B,
316 | |     {
...   |
323 | |         })
324 | |     }
    | |_____^
    |
    = help: consider adding an explicit lifetime bound `<A as maybe_nan::MaybeNan>::NotNan: 'a`...
note: ...so that the reference type `&'a <A as maybe_nan::MaybeNan>::NotNan` does not outlive the data it points at
   --> src/maybe_nan/mod.rs:313:5
    |
313 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, mut f: F) -> B
314 | |     where
315 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B,
316 | |     {
...   |
323 | |         })
324 | |     }
    | |_____^

so the problem is that without the A: 'a bound, the compiler can't guarantee that A::NotNan will live long enough. (Note that A::NotNan must live for at least 'a because we're creating a reference of life 'a to a value of type A::NotNan and passing it to the closure (the &'a A::NotNan parameter). The life of a reference must never be longer than the life of the type it's referencing.)

How does adding an A: 'a bound fix this? After all, we still haven't directly put a bound on A::NotNan like the compiler suggested. I think what's happening here is that an A: 'a bound does in fact guarantee that A::NotNan: 'a is also satisfied. If you look at the definition of the MaybeNan trait, you can see that it's not possible to implement the trait such that NotNan will have a shorter lifetime than Self. (Try to come up with such a case; you'll see it's not possible because any lifetime parameter of the NotNan type must also be a lifetime parameter of Self.¹) So, we're indirectly guaranteeing A::NotNan: 'a by specifying an A: 'a bound.

Fwiw, this is getting pretty into-the-weeds of lifetimes. I didn't think about this when I was writing the function. If we just follow the compiler's suggestions, starting from no bounds:

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

the compiler says that A::NotNan may not live long enough and suggests adding an A::NotNan: 'a bound:

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        A::NotNan: 'a,
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

then the compiler says that A may not live long enough (this is the other reason why we need the A: 'a bound; the function signature may not require it, but the implementation does), and the compiler suggests adding an A: 'a bound:

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        A: 'a,
        A::NotNan: 'a,
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

This then satisfies the compiler, and we could stop here. We could try removing the A::NotNan: 'a bound to see if the compiler is still satisfied; it turns out that this works. I can justify in retrospect why it did, but I didn't need to think through the details when I was writing the code; I could just follow the compiler's suggestions, and if I tried removing a bound that I needed, the compiler would let me know.

what happens if A doesn't live at least 'a?

I can't think of a situation where A: 'a wouldn't be satisfied because we're taking &'a self, as explained earlier. If it was possible to create such a case, then if A didn't live at least 'a, we'd be creating references with a lifetime that would outlive their data (e.g. the data might contain references to already-freed memory or something). That's why the compiler requires that A: 'a.

On another note, it would be possible to define indexed_fold_skipnan this way:

    fn indexed_fold_skipnan<F, B>(&self, init: B, f: F) -> B
    where
        F: FnMut(B, (D::Pattern, &A::NotNan)) -> B;

but that's not as convenient for the caller because they get &A::NotNan references that only live for the body of the closure. This would make argmin_skipnan more difficult to implement because we couldn't store a reference to the current minimum between different calls to the closure. This is an example of a higher-ranked trait bound. See the nomicon and the reference for details.

¹ Technical note: It turns out that this observation (associated types can never have a shorter lifetime than Self) is true for all traits in current Rust. Once Rust has generic associated types, that will no longer be the case for all traits because it will be possible to have traits like trait Foo { type Assoc<'a>; } where the associated type has a lifetime parameter not part of Self. However, it will still be true that the NotNan type in the MaybeNan trait cannot have a shorter lifetime than Self because type NotNan doesn't have any type/lifetime parameters.

jturner314 commented 5 years ago

I've added a "Breaking changes" label because adding a method to MaybeNanExt is breaking for any implementers of that trait (which should basically be no one in practice).

phungleson commented 5 years ago

Thanks for your awesome explanation! @jturner314 😄

It probably will take some more time to have the concept type with lifetime into my head (it is quite a rust thing, isn't it? haven't thought about this in other languages that I know) but your explanation and examples are super clear, thanks!

I will probably read more about higher-ranked trait bound.

jturner314 commented 5 years ago

It probably will take some more time to have the concept type with lifetime into my head (it is quite a rust thing, isn't it? haven't thought about this in other languages that I know) but your explanation and examples are super clear, thanks!

You're welcome. I recommend reading the book if you haven't already; it's very good.

The concept of lifetimes exists in other languages too; they just aren't explicit and the compiler doesn't provide support for getting them correct. Rust's innovation is that the language provides explicit support for dealing with lifetimes, and compiler checks them for correctness. For example, I've worked on C++ projects in the past, and it wasn't uncommon to talk about lifetimes and ownership, because trying to access an object in C++ after it's been freed is undefined behavior. For example, the docs for unique_ptr talk about lifetimes; unique_ptr is for moving around objects with dynamic lifetime. In all but the simplest cases, we'd try to maintain some safety by using smart pointers such as unique_ptr or shared_ptr.

Many languages work around lifetime issues by using things like reference counting (e.g. shared_ptr in C++) or garbage collection (e.g. Python, Ruby, Java, Go, JavaScript). That's why you've probably not seen the idea of a "type with a lifetime" if you've worked with garbage collected languages in the past. Reference counting and garbage collection ensure at runtime that the data that references point to remains valid while the references exist. (This is like the runtime equivalent of the Rust compiler checking at compile time that if you have a reference &'a A, A will live as long as 'a.) The Rust standard library does provide reference counting if you need it, but since the compiler checks that you're using normal references correctly, you generally don't need reference counting.

The concept of a "type with a lifetime" exists in other languages too. For example, a span in C++ is similar to a slice &'a [T] in Rust (which has lifetime 'a). If you try to access elements in the span/slice after the underlying data has been freed, you'll get undefined behavior (in C++; the Rust compiler guarantees this won't happen). The lifetime of the data in a C++ span is implicit, and the programmer has to be careful not to hold onto the span longer than its lifetime, while the lifetime of the data in a Rust slice is explicit, and the compiler checks that it can't be accessed longer than its lifetime. A type similar to span/&'a [T] in Python is a NumPy view, but Python has garbage collection, so the data in a view is preserved until the view goes away (so you can't get undefined behavior).

phungleson commented 5 years ago

Thanks a lot @jturner314 your explanation is very clear, I have read them a few times and each time things seem clearer and clearer.

It is also easier to understand when I read the book now. Thanks!

phungleson commented 5 years ago

Anything left to consider for this one @jturner314

Thanks!

jturner314 commented 5 years ago

Looks good to me. Thanks for working on this!