Open mydogisbox opened 4 years ago
We cannot change the existing definition of scan
, but maybe adding another method makes sense, provided that this is useful enough.
Isn't map_with_state
just map
? I don't see what the difference is given that Rust has stateful closures. scan
takes an iterator and can arbitrarily change how many elements are yielded (which no other adapter does, and may be too powerful and unweildy). This makes it useful in some niche cases.
@KrishnaSannasi
can arbitrarily change how many elements are yielded
This is exactly the part of the existing scan
implementation which I think is problematic. Its non-standard to not yield an output element for each input element (see https://en.wikipedia.org/wiki/Prefix_sum for the generally accepted definition). This makes it less generally applicable than otherwise.
You can certainly implement the standard scan
definition using map
and a stateful closure, but its much less obvious what is happening and requires application developers to implement more logic. scan
is a standard function in most modern languages.
In my case I'm laying out lines of text with word wrapping so I need to keep track of the y-offset of the end of the previous line and yield the character offset of where to split each line. The wikipedia article I linked lists other applications.
Building on the naming in the wikipedia article, maybe we could add inclusive_scan
and exclusive_scan
and deprecate scan
.
If we look at the different map and fold methods that exist, we can infer how scan methods should behave and be named.
Regular mapping :
fn map<B, F>(self, f: F) -> Map<Self, F>
where
F: FnMut(Self::Item) -> B
Filter mapping (F returns an Option<B>
. When encountering None
, the value is skipped and the mapping goes on):
fn filter_map<B, F>(self, f: F) -> FilterMap<Self, F>
where
F: FnMut(Self::Item) -> Option<B>
While Mapping (F returns an Option<B>
. When encountering None
, the mapping ends):
fn map_while<B, P>(self, predicate: P) -> MapWhile<Self, P>
where
P: FnMut(Self::Item) -> Option<B>
Seeded folding:
fn fold<B, F>(self, init: B, f: F) -> B
where
F: FnMut(B, Self::Item) -> B
First-seeded folding:
fn fold_first<F>(self, f: F) -> Option<Self::Item>
where
F: FnMut(Self::Item, Self::Item) -> Self::Item
A scan is literally meant to map each step of a folding operation. As it's somehow a combination of both operations, the concept of "scanning" might be any combination of their variants. If we stick to consistency in the naming, the possible scanning methods would be :
Seeded | First as seed | |
---|---|---|
Regular |
~~~rust
fn scan |
~~~rust
fn scan_first |
Filter |
~~~rust
fn filter_scan |
~~~rust
fn filter_scan_first |
While |
~~~rust
fn scan_while |
~~~rust
fn scan_first_while |
(note that I'm not 100% certain that the _first
versions actually need to require a function returning a Self::Item
. They might be implemented requiring functions returning any B
with a Self::Item
accmulator internally, but it's non-trivial and I've done enough research one today ^^)
We can see from this table that the current behavior of scan is the behavior we could expect of scan_while()
. It's made even more obvious by the observation that map_while()
is just a particular case of the current implementation of scan()
(where you just ignore the accumulator) : https://github.com/rust-lang/rust/issues/68537#issuecomment-650792583.
It is indeed very hard to change the current definition of scan as it would completely break existing code. Changing the naming convention could be the way to get around the problem (and indeed depreciating scan()
). Additionally, if we want to keep some form of consistency, it would also be needed to change the name of fold()
(and depreciate the current one). Here's a proposal :
fold_seed() // currently: fold()
fold_first() // currently: fold_first(), no change here
scan_seed() // no method in rust
scan_first() // no method in rust
filter_scan_seed() // no method in rust
filter_scan_first() // no method in rust
scan_seed_while() // currently: scan()
scan_first_while() // no method in rust
Obviously, not all of them are not needed. Some of them can be easily expressed using other ones (see the Wikipedia article above for details). Some of them have very few practical use and might not need to be implemented in Rust.
Though, I strongly agree with the original concern of the issue : having as the only available implementation of scan the one we currently have is quite terrible : more often than not, the "end scanning when F returns None
" behavior is unrevelant for the current use case and sometimes it's even getting in the way of what's being done, resulting in quite unreadable code like the one described by @mydogisbox or even worst if you're trying to achieve a filter_scan()
.
PS : I'm not sure how "seeded" and "first-seeded" relate to "inclusive" and "exclusive" scanning/folding. It seems to me that they're different concepts, and the Wikipedia page seems to disagree with me but in a very unclear way, so I chose not to confuse the two. They might be the same thing, IDK
@Mayt38 great points!
I was trying to implement a function which accepts an iterator and reduces adjacent repeating elements into only one element. Say we have [1, 1, 2, 3, 4, 4, 5]
, and I want the output to be [1, 2, 3, 4, 5]
. I found that I had to resort to the nested Option
solution. I really think that it is a great point that we should have scan not fusing the iterator. The current implementation violates the Single Responsibility Rule and really leads to confusion in that it behaves different from the common scan
's as in other languages.
Besides the fact that the current scan
should actually be called scan_filter
, I don't like that the closure takes a mutable reference. Aren't iterators all about avoiding mutability?
The proposed scan
(without _while
) would be equivalent to
let mut state = init;
iter.map(move |x| f(state, x));
This is at least as simple and pretty as scan
(map
and its closure both take one argument while scan
and its closure require two arguments which might get ugly), and more importantly, it doesn't hide the fact that there is some kind of mutability. Maybe that's a reason why a "simple" scan never made it to the standard library. However, after introducing map_while
, the current scan
could also be replaced in this way. In my opinion, scan
becomes quite useless.
Let's consider Haskell's scanl
. It is very similar to folding but it preserves all intermediate accumulator states and creates a new list of it. Implementing this in Rust gets more complicated.
First try:
let mut state = init;
iter.map(move |x| {
state = f(state, x);
state
}))
...but this does only work for copy types and it differs from Haskell's scanl
: The first item of scanl
's resulting list is always the init
element (so that it has one more item than the original list).
My next try, now passing the state to f
by reference:
let mut state = init;
iter.map(move |x| {
let new_state = f(&state, x);
std::mem::replace(&mut state, new_state)
})
...but now the last item is missing and we cannot simply get it out of the closure's state. Unless anyone else has a better idea, you would have to create your own iterator struct to emulate the behaviour of Haskell.
Haskell is a well-thought-out language and I think that scanl
is useful, in contrary to Rust's current scan
. So if you consider deprecating and replacing scan
, you might as well take these thoughts into consideration.
@Johannesd3 I didn't even think about how odd it is to have the state being mutable. Good point.
Aren't iterators all about avoiding mutability?
I disagree with this, iterators are about being declarative instead of imperative, mutability is an orthogonal concern. Rust doesn't shy away from mutability like in more functional languages, because sometimes mutability is required for performance or making the algorithm easier to understand. Limiting mutability artificially makes apis harder to use in Rust.
For example: Vec::retain
could give you &mut T
, but it doesn't, it gives you a &T
. This makes it painful to use. So painful, that other Vec
-like abstractions changed their retain to give a &mut T
(SmallVec::retain
). This will eventually be "fixed" by drain_filter
, which does give you a &mut T
, whenever that stabilizes. Because of this, I don't believe chasing after immutability, when it is easy to allow mutability is useful.
All that said, I also agree that the current implementation of Scan
is pretty niche and not fit for std.
Let's consider Haskell's scanl. It is very similar to folding but it preserves all intermediate accumulator states and creates a new list of it. Implementing this in Rust gets more complicated.
scan
is useless in Rust precisely because Rist has mutability, and is useful in Haskell precisely because it lacks mutability. In Rust map
can do pretty much everything that scanl
does in Haskell, so it doesn't seem useful to have an additional scan
functionality. Mutability isn't bad, unconstrained mutability is bad, and Rust provides the tools to constrain mutability without making it useless.
I wasn't claiming that scanl
in Rust would be super useful. But I think it is at least more useful than the current scan
.
One example: Creating an iterator that yields factorials using num_bigint
.
Using scan
:
iter::once(one()).chain((1..).scan(one(), |state: &mut BigUint, x: usize| {
*state *= x;
Some(state.clone())
}))
Using map
makes this not worse in my opinion:
let mut fac = one();
(0..).map(move |n: usize| {
if n > 0 {
fac *= n;
}
fac.clone()
})
And using a Haskell-like myscan
:
myscan(1.., one(), |state: &BigUint, x: usize| x * state)
Link to playground, containg the implementation of myscan
.
But the last one has two disadvantages: 1. The next factorial is calculated one step too early. 2. It is noticeably slower, but I am not sure why.
So at least there seems to be a reason why the way with the mutable reference was chosen. (However, I am curious what exactly makes it so much slower, even though it works without clone
)
EDIT: Wrong link
@RustyYato the more general scan is useful in rust when you want to chain iterator functions like (actual example from my code):
font.layout(todo, *scale, point)
.enumerate()
.scan(0., |row_offset, (index, item)| {
let item_offset = item
.pixel_bounding_box()
.map_or_else(|| item.position().x, |x| x.max.x as f32);
if index == 0 {
Some(Some(0))
} else if item_offset - *row_offset <= (WIDTH as f32) {
Some(None)
} else {
*row_offset = *row_offset + item_offset;
Some(Some(index))
}
})
.flatten()
.chain(once(todo.len()))
.collect::<Vec<usize>>()
.windows(2)
.map(|indices| &todo[indices[0]..indices[1]])
.collect::<Vec<&'a str>>()
In this example using map
with a mutable collection would be much less readable than using scan
due to how it changes the scope of what is needed to understand a particular step in the chain.
retain
seems like a poor example for why mutability would be a good thing generally since its an inherently mutating operation in the first place, whereas map
, scan
etc are not.
I'm not against mutability, but I do think that separating mutable operations from non-mutable operations improves readability because expectations are more clear. If I wanted to both transform each element and update a collection I would just use a for
loop. This is how its done in other languages I've used that have mutability controls (e.g. F#).
Whether this is done conventionally or enforced by the standard library definitions is, however, a different question. Maybe there is a case to be made that people ought to be allowed to mutate at will for performance or other reasons.
@Johannesd3
The perf difference is probably because it's cheaper to do *=
on BigInt
then clone than to multiply &BigInt
by usize
.
This is because the former can reuse the allocation, while the latter cannot. This means that *=
on average does less reallocations, which is a perf improvement. The clone
is really fast because it only needs to do one allocation because the size is known ahead of time. As opposed to &BigInt * usize
, which lowers to big_int.clone() * usize
, which needs to reallocate more than *=
on average.
@mydogisbox
the more general scan is useful in rust when you want to chain iterator functions like (actual example from my code):
If every return of the Scan
is Some(_)
, then you can replace it with a map
, with no leakage into the outer scope like so:
I use this pattern when spawning threads, which is another place where you want some stuff stuffed into the closure, but not the containing scope. There are a number of ways to format this, I chose this way because it most closely mirrors scan
, but normally I put the let
binding on its own line.
retain seems like a poor example for why mutability would be a good thing generally since its an inherently mutating operation in the first place, whereas map, scan etc are not.
Sure, retain itself is a mutable operation, but most of the time you don't need to mutate the elements of the Vec
while the retain is in progress. But it does come up, and when it does the fact that retain
doesn't provide the more permissive API is frustrating.
I would argue that scan is a mutable operation, you're providing a state that is packaged with the iterator that will be mutated on each iteration. I can see map
as being normally immutable.
@RustyYato Instead of using fold
to determine the n-th factorial, would it be a faster to implement it with *=
? More generally: Are there cases in which fold
is not the fastest option? In that case, would it be fine to sacrifice some performance of scan
if it increases the consistency between fold
and scan
?
There was recently an internals thread on exactly this topic here. There it was also shown that for_each
can be more performant than fold
, because for_each
mutates in place, where as fold
needs to move things around, so even there minimizing copies is enough to get a perf improvement.
I don't know how you can change scan
without breaking back-compat.
I wasn't aware of that topic, but it is quite related. I'm not sure whether to continue the disscussion here or there.
However, one interesting point from the discussion: Map
still implements DoubleEndedIterator
, so it does not guarantee that it is consumed in the right order. That's a reason why scan
could make sense.
When I talk about "replacing" scan
I think of adding a new function similar to scan
with different name.
I wasn't aware of that topic, but it is quite related. I'm not sure whether to continue the disscussion here or there.
I would just continue here
However, one interesting point from the discussion: Map still implements DoubleEndedIterator, so it does not guarantee that it is consumed in the right order. That's a reason why scan could make sense.
Another option is to add a combinator that removes the DoubleEndedIterator
impl so that you can get that guarantee, but that doesn't seem worth it. But in general, I don't see how this can be a problem. In particular, either you consume the iterator where you create it, then you know when not to call rev
, or you return the iterator, which usually means returning impl Iterator<Item = ???>
, which prevents rev
anyways. Without specialization, there isn't a way to get around this, and with specialization, there are many other ways to break traits, so you need to be careful anyways.
When I talk about "replacing" scan I think of adding a new function similar to scan with different name.
Ahh, that makes more sense.
There was some suggestions about scan
replacement on internals: scan
is too complicated, suggestion for new iterator methods (There also was useful conversations and ideas). Here is the text from this topic:
AFAIK scan
iterator method is the only iterator method that provides a mechanism to share some mutable state across iterations. But I think this method is too complicated and wide, therefore it should be replaced by more convenient and simple methods.
Let's look at scan
usage example:
let a = [1, 2, 3];
let mut iter = a.iter().scan(1, |state, &x| {
// each iteration, we'll multiply the state by the element
*state = *state * x;
// then, we'll yield the negation of the state
Some(-*state)
});
assert_eq!(iter.next(), Some(-1));
assert_eq!(iter.next(), Some(-2));
assert_eq!(iter.next(), Some(-6));
assert_eq!(iter.next(), None);
Observations:
state
, we can't get it from the first iteration.state
is provided by &mut
reference, so we must change it by using *state = ...
syntax, not by returning a value. So, we always need curly brackets. But it's a powerful tool: we can compute a return value before changing state, or after.Some()
of next value. And by this we do as same as filter_map
. Because of this Scan
doesn't implement ExactSizeIterator
, so we can't use it to allocate exact space when using the collect
method.I think:
state
should be changed not by mutable reference, but by function Fn(State, &Item) -> State
, because it is more simple and don't require {}
in simple cases.filter_map
or map
or filter
behavior inside this iterator, it should be done by the next iterator adaptors.So, I suggest these new methods to replace scan
: state_before
, state_after
with equal signatures.
Example:
let a = [1, 2, 3];
assert_eq!(a.iter().state_before(None, |state, &item| state * item).eq(&[(1, 1), (2, 2), (6, 3)]));
assert_eq!(a.iter().state_before(Some(4), |state, &item| state * item).eq(&[(4, 1), (8, 2), (24, 3)]));
assert_eq!(a.iter().state_after(None, |state, &item| state * item).eq(&[(1, 1), (1, 2), (2, 3)]));
assert_eq!(a.iter().state_after(Some(4), |state, &item| state * item).eq(&[(4, 1), (4, 2), (8, 3)]));
None
, then the initial value is yielded from the first iteration. If it is Some(...)
, then it initialized by the inner value.Fn(State, &Item) -> State
. That function receives current state and current item by reference and it must return next state value.state
because these methods provide state across iterations.before
means that the state will be modified before returning the element. Suffix after
means that the state will be modified after returning the element.Item = (State, OldItem)
by analogy as enumerate
.ExactSizeIterator
.filter_map
next and get the same behavior as a scan
.What do you think?
I used scan_before
and scan_after
in my code for competitive programming to find maximum in prefix sum array:
let max1 = r.iter().scan_after(0, |acc, x| acc + *x).map(|(x, _)| x).max().unwrap();
Mostly it's useful for something like creating or searching in prefix sum/max arrays.
@optozorax IMHO using integers in your example is also a bit specialized. The abstraction provided by scan
should be general enough to play well with the overall ownership model.
What's annoying to me is that since scan
ning produces a list of (owned) partial sums, where do those ownerships come from? (In the current version of scan
you need to explicitly clone, which is quite clear; in fold
however, you are free from such problems since you're merging multiple ownerships into one).
Main problem with current scan or map with external state is that the mutation of state usually returns ()
which forces to use multiline closure with {}
block, which leads to ugly code. Simple processing like prefix sums mentioned above should be simple and elegant like iter.map_scan(0, |acc, x| acc + x)
, not 5 lines of curly braces and parenthesis. Otherwise it defeats the whole purpose of these helper functions, might as well just use for
everywhere, it has maximum possible flexibility (and hence unreadability).
scan
feels awkward to me:
scan
forces the closure to maintain mutable state, which always decreases code readability.scan
should have good composability, but it does too many things. It should return internal states and then caller could use map
to produce an iterator that returns results.I just want an auxiliary function with a signature as simple as fold
to generate prefix sums, but scan
is completely unsuitable.
Prefix sums, or more generally Haskell's scanl
, is easier to write using successors
instead of scan
but it would be nice to have it available on its own. It does seem to be relatively niche, so I have suggested it to the itertools crate here. See also another discussion at the users forum.
Here is the existing type signature for std::iter::scan:
fn scan<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F> where F: FnMut(&mut St, Self::Item) -> Option<B>,
If I were to use a verbose name for
scan
as implemented now, I would call itmap_with_state_until
, this means that if I just wantmap_with_state
, things can get really awkward using the existing scan if I want to return an option which doesn't terminate the Iterator forcollect
etc. e.g.vecter.iter().scan( |state, x| if x > 1 then Some(None) else Some(Some(x)) ).flatten()
If we instead have
F
just turnB
instead ofOption<B>
like:fn scan<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F> where F: FnMut(&mut St, Self::Item) -> B,
then that becomes a lot more simple:vecter.iter().scan( |state, x| if x > 1 then None else Some(x) )
and achieving the existing
scan
behavior is trivially achieved by simply addingfuse()
.This allows
scan
to have behavior more in-line with similar functions likemap
andfold
:and also brings it more in-line with how other languages define
scan
:(a -> b -> a) -> a -> [b] -> [a]
(Haskell),('State -> 'T -> 'State) -> 'State -> seq<'T> -> seq<'State>
(F#) etc.I think this also gives a clear way forward to addressing other issues which previously ended up unresolved like: https://github.com/rust-lang/rust/pull/14425
So I think there ought to be
scan
as defined above, andscan_until
which is the existing implementation.References: http://zvon.org/other/haskell/Outputprelude/scanl_f.html
https://msdn.microsoft.com/en-us/visualfsharpdocs/conceptual/seq.scan%5b't,'state%5d-function-%5bfsharp%5d