Vector#iterator behaves wrongfully when calling take then toVector or toSet on it

alexarchambault commented 1 year ago

Reproduction steps

Scala version: all 2.13 as of writing this (from 2.13.0 up to 2.13.11)

To reproduce, run the following Scala CLI app:

//> using scala 2.13.11
//> using dep "com.lihaoyi::pprint:0.8.1"

object Test {
  def main(args: Array[String]): Unit = {
    val vec = (1 to 10).toVector
    pprint.err.log(vec)
    val it = vec.iterator
    pprint.err.log(it.take(2).toVector)
    pprint.err.log(it.take(2).toVector)
    pprint.err.log(it.take(2).toVector)

    val it0 = vec.iterator
    pprint.err.log(it0.take(2).toSet[Int])
    pprint.err.log(it0.take(2).toSet[Int])
    pprint.err.log(it0.take(2).toSet[Int])
  }
}

$ scala-cli run Test.scala
Test.scala:7 vec: Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Test.scala:9 it.take(2).toVector: Vector(1, 2)
Test.scala:10 it.take(2).toVector: Vector(1, 2)
Test.scala:11 it.take(2).toVector: Vector(1, 2)
Test.scala:14 it0.take(2).toSet[Int]: Set(1, 2)
Test.scala:15 it0.take(2).toSet[Int]: Set()
Test.scala:16 it0.take(2).toSet[Int]: Set()

Problem

I would expect the same output as in 2.12.18, that looks correct:

$ scala-cli run Test.scala --scala 2.12.18
Test.scala:7 vec: Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Test.scala:9 it.take(2).toVector: Vector(1, 2)
Test.scala:10 it.take(2).toVector: Vector(3, 4)
Test.scala:11 it.take(2).toVector: Vector(5, 6)
Test.scala:14 it0.take(2).toSet[Int]: Set(1, 2)
Test.scala:15 it0.take(2).toSet[Int]: Set(3, 4)
Test.scala:16 it0.take(2).toSet[Int]: Set(5, 6)

Workaround

One can call .map(identity) on the iterator to recover an iterator returning correct results:

//> using scala 2.13.11
//> using dep "com.lihaoyi::pprint:0.8.1"

object Test {
  def main(args: Array[String]): Unit = {
    val vec = (1 to 10).toVector
    pprint.err.log(vec)
    val it = vec.iterator.map(identity)
    pprint.err.log(it.take(2).toVector)
    pprint.err.log(it.take(2).toVector)
    pprint.err.log(it.take(2).toVector)

    val it0 = vec.iterator.map(identity)
    pprint.err.log(it0.take(2).toSet[Int])
    pprint.err.log(it0.take(2).toSet[Int])
    pprint.err.log(it0.take(2).toSet[Int])
  }
}

$ scala-cli run Test.scala
Test.scala:7 vec: Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Test.scala:9 it.take(2).toVector: Vector(1, 2)
Test.scala:10 it.take(2).toVector: Vector(3, 4)
Test.scala:11 it.take(2).toVector: Vector(5, 6)
Test.scala:14 it0.take(2).toSet[Int]: Set(1, 2)
Test.scala:15 it0.take(2).toSet[Int]: Set(3, 4)
Test.scala:16 it0.take(2).toSet[Int]: Set(5, 6)

som-snytt commented 1 year ago

https://www.scala-lang.org/api/2.13.11/scala/collection/Iterator.html

It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.

Welcome to Scala 2.13.11 (OpenJDK 64-Bit Server VM, Java 20.0.1).
Type in expressions for evaluation. Or try :help.

scala> 1.to(10).toVector
val res0: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> .iterator
val res1: Iterator[Int] = <iterator>

scala> .grouped(2)
val res2: res1.GroupedIterator[Int] = <iterator>

scala> .next()
val res3: Seq[Int] = ArraySeq(1, 2)

scala> res2.next()
val res4: Seq[Int] = ArraySeq(3, 4)

scala>

nafg commented 1 year ago

@som-snytt if someone wants to contribute an improvement wouldn't that be welcome?

som-snytt commented 1 year ago

@nafg I don't know how to improve the behavior of unspecified API, but anything deemed an improvement is usually welcome.

Sometimes a reviewer will say, Well, it's an improvement.

SethTisue commented 1 year ago

@nafg The problem is inherent to the API of Iterator — it's an API for one-shot iteration. So speaking broadly, we can't do anything. There are some specific situations where it's possible to be less surprising within the boundaries of what the API permits, but it isn't clear to me this is one of those. If someone thinks it is, then yes, pull request welcome, absolutely.

(Regardless, keeping the issue closed since this is already working as designed and documented.)

lrytz commented 1 year ago

IMO we should not spend time changing the code / reviewing changes to "improve" this. Client code that relies on using iterators multiple times is "wrong", it can break any time (new Scala version, changing to a different underlying collection).

nafg commented 1 year ago

If someone opened a pull request that made the behavior more consistent or less surprising you'd consider reviewing it a waste of time?

lrytz commented 1 year ago

I'd say so, yes. I mean we might review and merge it, but should it come with a unit test? IMO it doesn't make any sense to enforce / guarantee unspecified behavior.

sjrd commented 1 year ago

It's also important to understand that this lack of guarantee has a purpose: it allows Iterators to be efficiently implemented. Most efforts to make Iterators "more consistent" make them slower or take up more code. If you want something that you can use multiple times, use an Iterable. That's what the distinction is for.

nafg commented 1 year ago

Right, it shouldn't have a test, but if it doesn't break anything and enough people agree it's an improvement then why not.

I don't actually care about the behavior, I just feel like the fewer things there are for people to grumble about Scala on Twitter, the better. I know you people are really busy but closing tickets that curtly gives people the wrong taste in their mouth, and it's part of a general sentiment I've heard too many times. I know it's unfair, no one did anything wrong here, but I think a different message would be better for "PR" so to speak.

Anyway, I wish there was a way to make the dangers of iterators more obvious to anyone using them.

nafg commented 1 year ago

Most efforts to make Iterators "more consistent" make them slower or take up more code.

That's a good point.

(Although, is "take up more code" a bad price to pay if someone else were to write it? I guess it depends how much...)

sjrd commented 1 year ago

(Although, is "take up more code" a bad price to pay if someone else were to write it? I guess it depends how much...)

It still has to be reviewed and maintained. The "someone else"s rarely maintain stuff they contribute, unless they are active contributors in a repo.

Also, it's bad for Scala.js users. ;)

som-snytt commented 1 year ago

for people to grumble about Scala on Twitter

checking my feed

That would have been legit for Javascript, but for Scala, seriously?

I take it this is French for my code is wrong.

Calling 'take' on an iterator can screw the iterator however it wants

We can update the Scaladoc with that explanation.

This project has a history of remaining bugward compatible, but there is also a virtue in surfacing bugs.

make the dangers of iterators more obvious

A lint would be very helpful, as this is a timeless issue.

eed3si9n commented 1 year ago

I wonder if the compiler can prevent this by shadowing it after the first use, like a poor person's borrow checker.

Ichoran commented 1 year ago

I suppose there could be an annotation that warns on subsequent use. @terminalOperation or somesuch (to use Java's Stream lingo). I suppose you might also need an @validWhenTerminated annotation also on some methods, and/or @uncheckedTermination(x) to suppress the warning.

But I really don't think that multi-use iteration should be a goal. In general it can't possibly work, and in the cases where it kind of works, it often slows down performance and gives the wrong idea about whether it works in general.

It might be nice for Scala 3 at least if we had some helper methods to consume parts of a live iterator, leaving the iterator in whatever state you'd expect if you'd done the consuming manually.

For instance,

val b = Vector.newBuilder
for i <- 0 until 5 if itr.hasNext do b += itr.next
b.result

is a totally reasonable thing to do, and it's pretty unreasonable that you have to write that down if you want your vector live afterwards. Whether we do this as Vector.buildFrom(it, 5) or it.consumeTo[Vector](n) doesn't really matter, but having Iterator's interior mutability be so unpleasant to use is unfortunate.

But I don't think we should make take(n) work like splitAt(n). That's why we have splitAt(n). It's just kinda awkward to use and not very efficient.

(It would be less awkward if Scala had match-interior binding and var-assignment. (val xs, itr) = itr.splitAt(n) still has iffy ergonomics, but it's not nearly as bad as val (xs, itr2) = itr.splitAt(n) // please never touch itr again!!!)

SimY4 commented 1 year ago

Hi, sorry I just saw this issue in my feed.

just wanted to double-check why do you think this is behaviour is undefined?

isn’t Iterator’s usage contacts revolve around next and hasNext methods when one should guarantee to fail with NoSuchElementException when the latter returns true?

Isn’t this constraint broken here? Or anywhere where calling finaliser second time didn’t result in exception?

nafg commented 1 year ago

isn’t Iterator’s usage contacts revolve around next and hasNext methods when one should guarantee to fail with NoSuchElementException when the latter returns true?

Yes

Isn’t this constraint broken here?

No. The question is about expectations about scala collections methods on Iterator, about when they call next/hasNext.

Or anywhere where calling finaliser second time didn’t result in exception?

What do you mean?

Ichoran commented 1 year ago

@SimY4 - The behavior is undefined because there is no contract for how take utilizes next and hasNext. The only contract is that take results in a new iterator, and the old iterator is invalidated for further use.

As a practical matter, the API tends to make a best-effort to make calls like that lazy, but even that isn't guaranteed unless the API specifically says so. So a perfectly reasonable implementation of take(n) for an iterator is

def take(n: Int): Iterator[A] =
  val b = List.newBuilder[A]
  var i = 0
  while i < n && hasNext do
    b += next 
    i += 1
  b.result.iterator
// Original iterator has advanced n places when we return

But this is too:

// in Iterator
def take(n: Int): Iterator[A] =
  new TakeIterator(this, n)
  // Original iterator unchanged--packaged into a new class now!

// TakeIterator
class TakeIterator[+A](underlying: Iterator[A], private var n: Int) extends Iterator[A] {
  private var i = 0
  def hasNext: Boolean = i < n && underlying.hasNext
  def next: A =
    if i >= n then throw NoSuchElementException(s"Already consumed $n elements")
    else
      i += 1
      it.next
  override def take(m: Int): Iterator[A] =
    n = n min m
    this  // Look!  We mutated and returned OURSELVES!  This is totally cool!
}

The only guarantee is that once you call itr.take(n), you no longer have any guarantees about itr

SimY4 commented 1 year ago

@nafg just thought about it, it finalisers should probably return an empty container because hasNext should be called prior to next.

The two things around iterators that are documented explicitly:

next method throwing exception
Iterators being mutable

so, collecting an Iterator should traverse it until exhausted and any subsequent collectors should result in empty

SimY4 commented 1 year ago

@Ichoran shouldn’t iterators be cursors for underlying iterable? Taking this shortcut is unexpected since I could’ve had an iterable over an input stream.

also, does this meant that calling transformers on an Iterator could result in side effects?

Ichoran commented 1 year ago

@SimY4 - A Cursor would be a cursor for an underlying collection; they're not part of the standard library, though. Maybe they should be, if we could come up with a powerful and simple general-purpose interface.

If the Iterator is itself side-effecting (e.g. Iterator.iterate(0){ i => println(i); i+1 }) then yes, you might get side-effects.

Iterator is meant to be the most general possible interface for visiting data at most once. If you want additional guarantees, you need to enforce them manually through careful use of next and hasNext, or use something else. Collections methods make a best effort to evaluate lazily, but they don't guarantee it. If you want something that promises to be lazy, use LazyList.

SimY4 commented 1 year ago

@Ichoran I'll elaborate what I meant. Yes, iterators can be side effecting but if you're not calling to next or hasNext I have an intuition that no side effects will be performed. I.e. in your example:

val it = Iterator.iterate(0){ i => println(i); i+1 } // this should never print anything to the console
// but
it.hasNext // might print to the console ONCE

Iterator is meant to be the most general possible interface for visiting data at most once

Yes, I agree. And even though, I can have a hot iterable that produces elements on the go, and an iterator contract for next and hasNext should respect that, don't you think? I.e.

val it = producingIterator: Iterator<A> = ??? // like reading from a file that another process writes to

val firstSlice = it.toList // first slice of running iterator
val secondSlice = it.toList // second slice of running iterator starting from where the first one ended.

In this case all elements are seen only once because the iterator never revisits the ones that have already been processed. This would be the major difference between Iterators and LazyLists

Ichoran commented 1 year ago

@SimY4 - I don't understand your point.

Iterator.iterate produces a lazily instantiated iterator. It is part of the contract of iterate that the iterator generates is lazy. There is no contract that Iterator.iterate(z)(f).take(n) won't immediately create the n elements. It wouldn't be as useful if it did, and it turns out that in practice it doesn't, but it's not actually part of the contract. Maybe it should be.

next and hasNext can be used repeatedly; everything else is only guaranteed to work once. You could implement all kinds of different behaviors, including your producingIterator. The only thing you must do is make sure that if hasNext returns true, then the next call to next must return a value.

In general, though, the weirder the iterator the less likely it is to work properly with all code that might make assumptions about, for instance, getting the same answer on successive calls to hasNext.

SimY4 commented 1 year ago

My points are:

Iterators behaviours can be clearly defined - its weird to say that reuse of an Iterator is a UB.
OP showcased a few issues with iterators implementation where Iterators behaviours are inconsistent.

I think its worth spending some time on streamlining this since Iterators are more or less at the core of Scala standard library.

som-snytt commented 1 year ago

@SimY4 your points are incorrect in the current understanding of the API.

For your example of "taking a snapshot", there are methods such as span, which returns a prefix and suffix iterator.

My previous example was, if you want to take two at a time, use grouped(2).

Iterator behavior is clearly defined: it's specified that you can't reuse the iterator after any method but hasNext and next.

It would be nice if the compiler told you if you violate that contract.

I would compare iterators with futures -- a brittle API which might enjoy a better foundation.

So I disagree with your second bullet item. As with any API, you must not rely on implementation details that are not specified.

I don't mean to rule out some different behavior yet to be invented. But I think it is unfair to characterize the current API as broken when it is working as advertised.

Edit: I did not mean to say it's unfair to say an API is broken if it is unsatisfactory by design.

eed3si9n commented 1 year ago

@Ichoran wrote:

I suppose there could be an annotation that warns on subsequent use.

I have two PRs exploring this idea.

@onceop at method level - https://github.com/scala/scala/pull/10427
@affine at class level - https://github.com/scala/scala/pull/10428

scala / bug