scala / bug

Scala 2 bug reports only. Please, no questions — proper bug reports only.
https://scala-lang.org
232 stars 21 forks source link

TreeSet.map losing its ordering when used as Iterable #11987

Open jrudolph opened 4 years ago

jrudolph commented 4 years ago

(Seems like this has to be a well-known issue/puzzler but I couldn't find one with superficial searching)

It seems that TreeSet is not working as expected when it is used as an Iterable.

Welcome to Scala 2.13.1 (OpenJDK 64-Bit Server VM, Java 1.8.0_252).

scala> import scala.collection.immutable.TreeSet
import scala.collection.immutable.TreeSet

// as expected
scala> TreeSet[Long](5,3,2,12,38,2,6,2).map(identity).mkString(", ")
res0: String = 2, 3, 5, 6, 12, 38

// unexpected
scala> (TreeSet[Long](5,3,2,12,38,2,6,2): Iterable[Long]).map(identity).mkString(", ")
res1: String = 5, 6, 38, 2, 12, 3

The underlying assumption is that for all i: Iterable[T] where the implementing collection is sorted this property should be valid:

i.toList == i.map(identity).toList

Without the constraint "where the implementing collection is sorted", it is not expected to be true because Iterable can be implemented for unsorted collections.

Above I used the vague phrase "not working as expected" because the behavior is not explicitly specified (TreeSet does not specify how it implements Iterable which is a common problem of inheritance) and in absence of a specification it should do the least surprising thing.

Was observed here: https://github.com/spray/spray-json/issues/329

swsnr commented 4 years ago

I'd like to point out that i.toList == i.map(identity).toList wouldn't help with the spray-json issue you linked, because the affected code in spray-json maps into another domain (JSON values) so it's not really .map(identity).

For spray json you'd actually need something along the lines of "for all pure functions f: A => B and all i:

i.toList.map(f) == i.map(f).toList

In other words map should preserve the ordering of the source domain A and not order by the ordering of the target domain B.

I'm not sure whether that's better than the current behaviour: In terms of B the collection would still be unordered, just in a different way than it is now.

That said having (TreeSet[Long](5,3,2,12,38,2,6,2): Iterable[Long]).map(f) move to a hashed Set which has no real order at all is somewhat surprising :slightly_smiling_face:

jrudolph commented 4 years ago

I'd like to point out that i.toList == i.map(identity).toList wouldn't help with the spray-json issue you linked, because the affected code in spray-json maps into another domain (JSON values) so it's not really .map(identity).

There are two map methods on TreeSet, one that takes an ordering of the codomain to return another TreeSet. Then the one inherited from Iterable. I'm only talking about the one inherited from Iterable.

There's no way to tell the map method inherited from Iterable what the ordering of the codomain should be so it returns a HashSet.

I guess the reason it was done that way was to keep compatibility with the previous CanBuildFrom construction from Scala 2.12 and before. The point of CanBuildFrom was to return "the right thing implicitly" (which was hotly debated).

There are a few reasons that the current behavior might never change:

The workaround for Sets is to mostly ignore that Sets inherit from Iterable but in particular the domain changing methods like map and rather the use the methods from the iterator it returns. I guess there could be something like a Set.toSafeIterable that returns a plain Iterable on which methods will work without surprises.

def toSafeIterable[T](s: Set[T]): Iterable[T] = new Iterable[T] { def iterator: Iterator[T] = s.iterator }
SethTisue commented 4 years ago

in the same vein: https://github.com/scala/scala-parallel-collections/issues/101 (but the thing lost is parallelism instead of ordering)

swsnr commented 4 years ago

This problem exists on 2.12 and 2.13; we hit the original spray issue with both Scala versions.

The linked parallel collections issue is 2.13 only.

Jasper-M commented 4 years ago

IIUC the only fix for this would be to add overloads of map to Iterable just for this use case. Simplified:

def map[B](f: A => B): Iterable[B]
def map(f: A => A)(implicit d: DummyImplicit): Iterable[A]

Otherwise you can never distinguish between the case where you can keep the ordering and where you can't. If you want to keep getting a TreeSet back even for the A => B case you'd have to add the overload that takes an Ordering to Iterable which would make even less sense.

julienrf commented 4 years ago

The problem comes from the fact that TreeSet has several overloads of map:

Screenshot from 2020-05-11 09-25-33

When you call map, the compiler selects the overload that returns a TreeSet, because it is “more specfiic” by the overload resolution rules.

However, if you upcast the TreeSet to Iterable, then the compiler sees only one map operation, which returns an unsorted set.

On Scala 2.12, the situation was different but the result was the same (the source collection type was part of the CanBuildFrom argument to compute the resulting collection).

Unfortunately, I don’t see how we could fix this. I think we should live with this issue but maybe document it better?

jrudolph commented 4 years ago

There's also this issue:

scala> (TreeSet(5,3,2,12,38,2,6,2): Iterable[Int]).map(_ => 5).toList
res1: List[Int] = List(5)

So, there we have an it: Iterable[T] where this isn't true for all f:

it.size == it.map(f).size
julienrf commented 4 years ago

So, there we have an it: Iterable[T] where this isn't true for all f:

it.size == it.map(f).size

This property is not true for all iterables, only for seqs.

swsnr commented 4 years ago

Shouldn't iteration be size- and order-preserving? At least that's my perhaps naive intuitive expectation :shrug:

julienrf commented 4 years ago

This is not the case on Iterable, because sets and maps don’t necessarily have a specific iteration order.

swsnr commented 4 years ago

Still, I find it surprising if order gets lost for types which are ordered initially, and I'm also surprised that it.size == it.map(f).size doesn't hold for all iterables. Looks like I assumed a lot of things about iterables which aren't true in the general case; I'd love to see this documented in the scaladoc for Iterable.

Jasper-M commented 4 years ago

I find it surprising if order gets lost for types which are ordered initially, and I'm also surprised that it.size == it.map(f).size doesn't hold for all iterables

Though it would be impossible for both those expectations to hold. You can't have it.map(_ => 5) return a TreeSet and always preserve the size.

jrudolph commented 4 years ago

This is not the case on Iterable, because sets and maps don’t necessarily have a specific iteration order.

Technically, that's probably right because descriptions of the operators are vague enough to allow that behavior:

E.g. Iterable.map:

Builds a new iterable collection by applying a function to all elements of this iterable collection.

It only says that the function is called for every element of the source collection and that a new collection is built but not what the exact relationship between source and target collections are.

For Iterable.collect on the other hand, it does say that the order is preserved:

a new iterable collection resulting from applying the given partial function pf to each element on which it is defined and collecting the results. The order of the elements is preserved.

This is just false but shows how unclear the semantics of Iterable are even to the authors. Incidentally, filter and filterNot also disagree about whether ordering should be preserved or not.

That makes Iterable a pretty useless abstraction. It means you mustn't accept Iterable as arguments in your APIs or things will break eventually.

Once again, look at that implementation of toSafeIterable which contains the essence of iteration:

def toSafeIterable[T](s: Iterable[T]): Iterable[T] =
  new Iterable[T] { def iterator: Iterator[T] = s.iterator }

This looks like a no-op but it isn't and that tells a lot.

To add a possible explanation of the status quo: since Scala collections are eager, the problem with the generic methods that need to return another collection type are:

So for now, the safe option is to

julienrf commented 4 years ago

For Iterable.collect on the other hand, it does say that the order is preserved:

a new iterable collection resulting from applying the given partial function pf to each element on which it is defined and collecting the results. The order of the elements is preserved.

This is just false but shows how unclear the semantics of Iterable are even to the authors. Incidentally, filter and filterNot also disagree about whether ordering should be preserved or not.

Ouch, this is a problem, indeed!

That makes Iterable a pretty useless abstraction. It means you mustn't accept Iterable as arguments in your APIs or things will break eventually.

I disagree. As a user, you can still use fold-like operations on iterables without issues. Also, you can flatten a Set[List[Int]] into a Set thanks to the fact that both share a common collection type (IterableOnce, in this case).

I disagree with your conclusion. I think we should fix this problem by better documenting the behavior of Iterable (and fixing the errors in the current documentation), to clearly state that transformation operations on Iterable don’t preserve the order of elements.

jrudolph commented 4 years ago

I disagree. As a user, you can still use fold-like operations on iterables without issues. Also, you can flatten a Set[List[Int]] into a Set thanks to the fact that both share a common collection type (IterableOnce, in this case).

Maybe I wasn't accurate enough. Iterable may have some useful properties or features in its implementation. In that way it is useful. What I meant that accepting an Iterable as a library writer is not useful because it gives so little guarantees that I cannot safely interface with it. Mixing some useful with some dangerous operations doesn't help in that regard.

you can flatten a Set[List[Int]] into a Set

That's another example where the features of the concrete types (that happen to be implemented by Iterable) can be useful, but if you happen to come across an instance of such a type disguised as an Iterable[Iterable[Int]] or similar and you call flatten on this in a library you will be up for surprises.

NthPortal commented 4 years ago

What I meant that accepting an Iterable as a library writer is not useful

That tends to be correct, yes. Generally, you'll either want an IterableOnce (since you only need to iterate over it once to do whatever you're doing) or a more specific collection type (SeqMap, Set, ect.) because you want more specific properties. It is uncommon that the only property you care about is that it's iterable more than once.