scala / bug

Scala 2 bug reports only. Please, no questions — proper bug reports only.
https://scala-lang.org
232 stars 21 forks source link

optimize Vector concatenation (Vector ++ Vector) #4442

Closed scabug closed 1 year ago

scabug commented 13 years ago

currently scala.collection.immutable.Vector.++ simply inherits its implementation from a supertrait (through the stubs added in r24227), so if you concatenate two vectors, it doesn't take advantage that that both are the same structure, but just treats the second vector like any traversable.

in theory this could be O(log n), not O(n), yes? Tiark seems to confirm it in this comment on #3724: "Unfortunately, for implementing a fast concat operation (we hope to do that for 2.8.1 or 2.9) heterogenous Arrays seem to be necessary (we'll be storing Int arrays next to the sub nodes). We might rethink this however, and try to stick to homogeneous Arrays."

I just thought there should be enhancement ticket on this as it would still be nice to have someday.

scabug commented 13 years ago

Imported From: https://issues.scala-lang.org/browse/SI-4442?orig=1 Reporter: @SethTisue See #3724

scabug commented 13 years ago

@TiarkRompf said: I've been having this on my todo list for quite a while now. The problem is that it requires pervasive changes to the current internal structure and those are hard to get right without slowing down other uses of Vector. Performance will most likely be amortized log n, not worst case. It's good to have a ticket for this, though.

scabug commented 13 years ago

@SethTisue said: Is Phil working on this? (the latest Scala meeting notes seem to say so)

scabug commented 12 years ago

@TiarkRompf said (edited on Mar 2, 2012 9:10:30 PM UTC): Here's a link to the paper: http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf?version=1

Still need to resolve integration with the current Vector implementation. Maybe it's better to have a separate CatenableVector as a first step.

scabug commented 11 years ago

@pchiusano said: Related to this, (1 to N).foldLeft(Vector[Int]())(_ :+ _) runs in linear time, but (1 to N).foldLeft(Vector[Int]())((acc,a) => acc ++ List(a)) is quadratic. This is extremely surprising (I spent several hours tracking down a performance bug in scalaz-stream that ended up being caused by this). Since Vector has a constant time snoc operation, it seems like the default implementation of ++ should be repeated calls to :+, which doesn't require any internal changes to Vector.

scabug commented 11 years ago

@SethTisue said: Paul: I'm extremely glad you noticed this. I don't think anyone realized the situation was that bad! I opened a separate ticket on it at #7725.

scabug commented 10 years ago

@adriaanm said: was this fixed along with SI-7725?

scabug commented 10 years ago

@Ichoran said: This was not fixed--this is about preserving structural sharing across the entire tree, while the other was to at least start from the bigger of two collections to get at least some sharing.

scabug commented 10 years ago

@SethTisue said: reassigning to "Backlog" on the grounds that the research on how it would even work hasn't been done yet

scabug commented 8 years ago

@Blaisorblade said: Shouldn't RRB-Vector be a potential fix for this, according to the ICFP'15 paper describing it?

https://github.com/nicolasstucki/scala-rrb-vector https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf http://dl.acm.org/citation.cfm?id=2784739

joshlemer commented 6 years ago

These are very relevant! paper describing a potentially new data structure for Vectors: https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf implementation: https://github.com/nicolasstucki/scala-rrb-vector

V-Lamp commented 5 years ago

Is this addressed by any chance in the new collections in 2.13? 🙂

joshlemer commented 5 years ago

@V-Lamp the Vectors in 2.13 will not be the new data structure (RRB Vectors), but their concatenation may be improved enormously by https://github.com/scala/scala/pull/7588 which ensures that during concatenation of vectors, the structure of the left Vector is used rather than copying it into a new Vector. This essentially means that Vector concatenation will be constant-time on the size of the left vector (but still O(n) where n = size of right vector).

szeiger commented 4 years ago

https://github.com/scala/scala/pull/7588 was reverted due to https://github.com/scala/bug/issues/11636 but the new Vector implementation that I am working on for 2.14 also has this optimization (appending n elements to a Vector of size m is O(n + log(m+n))). I don't know if I'll find the time to implement it but the same is possible for prepending. The combination of the two would allow concatenation depending on the size of the shorter slice.

SethTisue commented 3 years ago

Note that Vector was completely redone for 2.13.2: https://github.com/scala/scala/pull/8534

Has anyone studied what the current situation is w/r/t to this ticket?

SethTisue commented 3 years ago

I see that on https://github.com/scala/scala/pull/8534 @szeiger wrote,

I did not get around to implementing an optimized concatenation where the right-hand side vector determines the alignment. Like most other algorithms for this data structure it is quite easy to visualize how it works and assure yourself that it is possible but actually implementing it correctly is much harder. This would allow concatenation of two vectors of sizes m and n in O(min(n, m) + log(n+m)) time.

So that sounds like we're still where we before: doing concatenation as efficiently as possible remains future work.

ansvonwa commented 2 years ago

For the record: I've implemented concatenation in O(min(n, m) + log(n+m)) time. (If you're really lucky and the Vector structures are perfectly aligned, it's even only O(log(n+m)).) It's not yet PR-ready, but I hope I'll be able to polish it in a month or so when I'm back from vacation.