knownSize vs hasDefiniteSize

MasseGuillaume commented 6 years ago

I was working on scala-collection-compat to add an extension method for knownSize and I notice a couple of inconsistencies.

This tables summarize a bunch of collection containers, their knownSize in 2.13.X, their hasDefiniteSize in 2.12.X if available (? otherwise).

For example:

Some(1) has knownSize = -1, I expect knownSize = 1. Ibid for cBitSet, etc
c.Iterator(1) has knownSize = 1, I expect knownSize = -1, this would be consistent with hasDefiniteSize = false

Short name	knownSize	hasDefiniteSize	Source
some	-1	true	Some(1)
none	0	true	None
option	-1	true	Option(1)
array	1	true	Array(1)
string	1	true	"a"

cBitSet	-1	true	c.BitSet(1)
cBufferedIterator	-1	false	scala.io.Source.fromString("a").buffered
cIndexedSeq	1	true	c.IndexedSeq(1)
cIndexedSeqVie	1	true	c.IndexedSeq(1).view
cIterable	-1	true	c.Iterable(1)
cIterableOnce	-1	?	(c.Iterable(1): c.IterableOnce[Int])
cIterator	1	false	c.Iterator(1)
lazyZip2	-1	?	(List(1) lazyZip List(2))
lazyZip3	-1	?	(List(1) lazyZip List(2) lazyZip List(3))
lazyZip4	-1	?	(List(1) lazyZip List(2) lazyZip List(3) lazyZip List(4))
cLinearSeq	-1	true	c.LinearSeq(1)
cMap	1	true	c.Map(1 -> 1)
cMapView	1	true	c.Map(1 -> 1).view
cSeq	-1	true	c.Seq(1)
cSeqView	-1	true	c.Seq(1).view
cSet	1	true	c.Set(1)
cSortedMap	-1	true	c.SortedMap(1 -> 1)
cSortedSet	-1	true	c.SortedSet(1)
cView	-1	?	(c.Seq(1).view: c.View[Int])

iArraySeq	1	?	i.ArraySeq(1)
iBitSet	-1	true	i.BitSet(1)
iChampHashMap	1	?	i.ChampHashMap(1 -> 1)
iChampHashSet	1	?	i.ChampHashSet(1)
iHashMap	-1	true	i.HashMap(1 -> 1)
iHashSet	-1	true	i.HashSet(1)
iIndexedSeq	1	true	i.IndexedSeq(1)
iIntMap	-1	true	i.IntMap(1 -> 1)
iIterable	-1	true	i.Iterable(1)
iLazyList	-1	?	i.LazyList(1)
iLinearSeq	-1	true	i.LinearSeq(1)
iList	-1	true	i.List(1)
iListMap	-1	true	i.ListMap(1 -> 1)
iListSet	-1	true	i.ListSet(1)
iLongMap	-1	true	i.LongMap(1L -> 1)
iMap	1	true	i.Map(1 -> 1)
iNil	0	true	i.Nil
iNumericRange	1	true	1L to 1L
iQueue	-1	true	i.Queue(1)
iRange	1	true	1 to 1
iSeq	-1	true	i.Seq(1)
iSet	1	true	i.Set(1)
iSortedMap	-1	true	i.SortedMap(1 -> 1)
iSortedSet	-1	true	i.SortedSet(1)
iStream	-1	false	i.Stream(1)
iTreeMap	-1	true	i.TreeMap(1 -> 1)
iTreeSet	-1	true	i.TreeSet(1)
iVector	1	true	i.Vector(1)
iVectorIterator	1	false	i.Vector(1).iterator
iWrappedString	1	?	i.WrappedString('a')

mAnyRefMap	-1	true	m.AnyRefMap(Nil -> 1)
mArrayBuffer	1	true	m.ArrayBuffer(1)
mArrayBufferView	1	true	m.ArrayBuffer(1).view
mArrayDeque	1	?	m.ArrayDeque(1)
mArraySeq	1	?	m.ArraySeq(1)
mBitSet	-1	true	m.BitSet(1)
mBuffer	1	true	m.Buffer(1)
mHashMap	-1	true	m.HashMap(1 -> 1)
mHashSet	-1	true	m.HashSet(1)
mIndexedSeq	1	true	m.IndexedSeq(1)
mIterable	1	true	m.Iterable(1)
mLinkedHashMap	-1	true	m.LinkedHashMap(1 -> 1)
mLinkedHashSet	-1	true	m.LinkedHashSet(1)
mListBuffer	1	true	m.ListBuffer(1)
mListMap	-1	true	m.ListMap(1 -> 1)
mLongMap	-1	true	m.LongMap(1L -> 1)
mMap	-1	true	m.Map(1 -> 1)
mMultiMap	-1	true	new m.HashMap[Int, m.Set[Int]] with m.MultiMap[Int, Int]
mOpenHashMap	-1	true	m.OpenHashMap(1 -> 1)
mPriorityQueue	-1	true	m.PriorityQueue(1)
mQueue	1	true	m.Queue(1)
mSeq	1	true	m.Seq(1)
mSet	-1	true	m.Set(1)
mSortedMap	-1	true	m.SortedMap(1 -> 1)
mSortedSet	-1	true	m.SortedSet(1)
mStack	1	true	m.Stack(1)
mStringBuilder	0	true	new m.StringBuilder()
mTreeMap	-1	true	m.TreeMap(1 -> 1)
mTreeSet	-1	true	m.TreeSet(1)
mUnrolledBuffer	-1	true	m.UnrolledBuffer(1)
mWeakHashMap	-1	true	m.WeakHashMap(1 -> 1)

bufferedSource	-1	false	scala.io.Source.fromFile(new java.io.File("../../../build.sbt"))
source	-1	false	scala.io.Source.fromString("hello")

tuple2Zipped	-1	true	(List(1), List(2)).zipped
tuple3Zipped	-1	true	(List(1), List(2), List(3)).zipped
zippedIterable2	-1	?	(tuple2Zipped: runtime.ZippedIterable2[Int, Int])
zippedIterable3	-1	?	(tuple3Zipped: runtime.ZippedIterable3[Int, Int, Int])
systemProperties	-1	?	sys.props

julienrf commented 6 years ago

Thanks @MasseGuillaume for the investigation! Could you add an “expected” column so that we know what actions should be performed?

szeiger commented 6 years ago

Regarding the examples: We need to check the path taken by converting Some(1) to Iterable, it's reasonable to expect a known size.

OTOH, it makes perfect sense that c.Iterator(1) has knownSize = 1, this an improvement over the previous state where you would get hasDefiniteSize = false.

I'm rescheduling this for RC1. Better known sizes are a performance improvement that can still be done after M5.

Ichoran commented 6 years ago

There's a correctness issue as well--hasDefiniteSize no longer obeys its old contract of reporting when the collection is unbounded (i.e. potentially infinite). Shall I open another issue for that one?

The old contract also states that Iterator is to report it is potentially unbounded even when it is taken from a finite collection.

joshlemer commented 6 years ago

some | -1 | true | Some(1)

Option[A] only has knownSize through implicit conversion to Iterable[A] which in implementation, creates a List[A].

How about changing ::.knownSize to if (tail == Nil) 1 else -1?

julienrf commented 6 years ago

Option[A] only has knownSize through implicit conversion to Iterable[A] which in implementation, creates a List[A].

Then we have lost this when we migrated to scala/scala. The goal was to avoid the creation of an intermediate List when we are only interested in an Iterator (which is the case in flatMap, for instance). We could re-implement it like this, btw:

implicit def optionToIterableOnce[A](maybeA: scala.Option[A]): IterableOnce[A] =
  if (maybeA.isEmpty) Iterator.empty else Iterator.single(maybeA.get)

Iterator.single should override knownSize, BTW.

joshlemer commented 6 years ago

I'm gonna open a PR for this, I also noticed there's two implementations of single-element Iterators, which I'll combine.

szeiger commented 6 years ago

@Ichoran You're right that hasDefiniteSize is not identical to knownSize >= 0. Would it make more sense to add concrete hasDefiniteSize implementations in 2.13 for compatibility instead of using the simple but not quite correct version we have now?

I don't see a reason for keeping the old semantics around In the long run. Quoting the 2.12 scaladocs:

Note: many collection methods will not work on collections of infinite sizes. The typical failure mode is an infinite loop. These methods always attempt a traversal without checking first that hasDefiniteSize returns true. However, checking hasDefiniteSize can provide an assurance that size is well-defined and non-termination is not a concern.

In other words: Even though the standard library provides this method it does not actually make any use of it. How useful is this method for users if it's not even useful for other collection methods?

The old contract also states that Iterator is to report it is potentially unbounded even when it is taken from a finite collection.

From the 2.12 scaladocs again:

Non-empty Iterators usually return false even if they were created from a collection with a known finite size.

Note the "usually", there is no guarantee. My guess is that this was the simplest way to implement it and nobody wanted to put in the extra effort because they knew it wouldn't be useful anyway.

Ichoran commented 6 years ago

I think we should add concrete implementations of hasDefiniteSize for compatibility. I am agnostic about whether we should retain the method in the long run, or simply deprecate it. On the one hand, there isn't any other way to check for possibly unbounded collections (knownSize < 0 doesn't mean it's possible); on the other, unbounded collections are not very common, and extending the collections isn't very common.

NthPortal commented 6 years ago

I think I'm in favour of deprecating hasDefiniteSize and removing it in the long run, as it has poor semantics (both in terms of what is documented and what is possible), and is also not very actionable.

Poor semantics:

The old contract also states that Iterator is to report it is potentially unbounded even when it is taken from a finite collection.

This is not really a desirable property, as it means that the method is unnecessarily conservative, and may prevent the use of collections which actually have a definite size.

My guess is that this was the simplest way to implement it and nobody wanted to put in the extra effort because they knew it wouldn't be useful anyway.

I understand why it's significantly simpler to implement, and don't think it should be done otherwise; however, it still undermines the value of the method.
Some collections (e.g. LazyList) are unable to report that they have a definite size, even when it is known to the programmer. Creating a LazyList from the lines of a file will always yield one of finite size (unless you manage to create an infinitely large file?), but there's no way for the LazyList instance to know or relay this information. Many of the generators for LazyList (e.g. fill, tabulate) create fixed-size LazyLists, but they too cannot relay this information.

Stream in 2.12 scans through itself as long as its tails are defined (while checking for cycles), and returns true if all elements are evaluated. This strategy would work for LazyList as well, but unfortunately, it runs in O(n) time in the case where it returns true (for both Stream and LazyList), which seems undesirable.

Not actionable:

What are you supposed to do if you call hasDefiniteSize and it returns false? Throw an exception? java.lang.IllegalArgumentException: the collection passed in was not definitely finite in size oh great thanks

scala / bug

knownSize vs hasDefiniteSize #11065