Closed meooow25 closed 7 months ago
inits
is not asymptotically optimal. Unlike for lists, useful internal sharing is possible. By concatenating buffers we can use larger chunks. An implementation that doesn't care to be properly lazy can achieve $O(n)$ easily: inits = map TL.fromStrict . TS.inits . TL.toStrict
. But with proper laziness I expect $O(n \log(numChunks))$ is optimal.map
is lazy and doesn't evaluate the earlier prefixes.map
ing has to traverse the spine of the layer below, even if it's not evaluating the elements of that layer below. So we generate the entirety $O(i \times numChunks)$ inits
structure before the first element of the last prefix becomes available. Gross. The set-up in LazyByteString.inits
avoids this by not building intermediate lists, but still suffers from the map (take 1)
problem below.)SnocBuilder
stuff in List.inits
is to ensure stuff like map (take 1) (inits xs)
needs only $O(n)$ time instead of $O(n^2)$.An implementation that doesn't care to be properly lazy can achieve $O(n)$ easily
Sure, I considered it a given that we want to preserve laziness. More specifically, we preserve the chunk boundaries in the result. Also, for now, I haven't distinguished between length and chunk length for simplicity. In the worst case they are equal. Of course, we want to consider this if it matters between competing implementations.
But with proper laziness I expect $O(n \log (numChunks))$ is optimal.
How?
Evaluating the i-th text in the result only takes $O(i)$ time, because
map
is lazy and doesn't evaluate the earlier prefixes.
map
still takes time to reach the required position in the list. The $O(i^2)$ time can be demonstrated, of course.
lazy text
inits: OK
27.7 ms ± 1.1 ms
last . inits: OK
3.07 ms ± 235 μs
take 1 . last . inits: OK
3.00 ms ± 202 μs
map (take 1) . inits: OK
3.05 ms ± 176 μs
list
inits: OK
14.4 ms ± 1.3 ms
last . inits: OK
14.7 μs ± 1.1 μs
take 1 . last . inits: OK
8.05 μs ± 456 ns
map (take 1) . inits: OK
107 μs ± 10 μs
All of these except for inits
can be $O(n)$ but are $O(n^2)$ for lazy text.
More specifically, we preserve the chunk boundaries in the result.
Well, if you insist that the chunk boundaries in in the i-th output Text
agree with the chunk boundaries within the first i characters of the input Text
then we cannot do better. But that's very specific!
But with proper laziness I expect $O(n \log(numChunks))$ is optimal.
How?
As mentioned, the basic idea is to reduce the total number of chunks in the output by concatenating input chunks. I haven't worked through all of the details, but here's a (rather long, sorry) sketch:
Upper bound:
Lower bound:
map
still takes time to reach the required position in the list.
Yeah, I remembered the problem of map
's stupid intermediate lists a little bit after I posted. (The trickier map (take 1)
still is the worst-case behavior when just squashing the map
layers together like I did in the bytestring PR.)
A better implementation of inits
would be welcome! The Data.List
implementation seems like a good starting point.
I am unsure about an implementation that concatenates chunks to produce later prefixes faster. For one thing, it can have a negative effect on lazy texts consisting of few big chunks, where the work of re-chaining chunks together is negligible compared to concatenating them. That is arguably the common case.
Is using a queue or a difference list as in OldList and bytestring worth it? Instead isn't it sufficient to keep only the length as an accumulator, so each prefix can simply be generated as take n
? Am I missing something?
inits :: [a] -> [[a]]
inits xs = [take n xs | n <- [0 .. length xs]] -- the [0 .. length xs] list should be fused away.
That seems much simpler! It needs to be modified to support infinite lists however.
I traced the OldList implementation to https://gitlab.haskell.org/ghc/ghc/-/issues/9345. This take
version was considered (search for initsT
) but the current version was found to be better.
The same might not apply to lazy text though, so it is worth evaluating.
Thanks for finding that thread!
@treeowl might you remember why the queue implementation of inits
was chosen in the end instead of the simpler one relying on take
? It is mentioned briefly (as initsT
) in the ghc issue https://gitlab.haskell.org/ghc/ghc/-/issues/9345 but then "initsQ2
really is better".
I also dug up the libraries thread https://mail.haskell.org/pipermail/libraries/2014-July/023328.html
@Lysxia , it was chosen because it was (surprisingly to me) faster. As I recall, someone who knows more than I do speculated that this had to do with the way GHC represents closures.
I was also quite surprised for a while but I found a simple explanation. I was wrongly assuming that Okasaki's banker's queue was somehow incurring unnecessary overhead. In fact, take
does more work per element than the banker's queue's toList
function: take
has to do one comparison on Int#
and one match on a cons cell for every element, whereas toList
is a sequence of (++)
and reverse
(each one producing half of the output list), which only have to match on a cons cell at every step (no Int#
comparison).
To confirm this, I managed to write an even faster inits
by replacing the banker's queue with a mutable array, whose toList
avoids the work of consuming a list.
Implementation and benchmarking https://gist.github.com/Lysxia/3b733d87da8ea056c4e4a27fc047e170
Benchmark results (initsA
is the new implementation with arrays, initsQ = Data.List.inits
, and initsT
uses take
):
Plain text output:
benchmarking initsA
time 216.7 ms (210.6 ms .. 221.5 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 214.4 ms (210.5 ms .. 216.7 ms)
std dev 4.450 ms (1.862 ms .. 7.362 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking initsQ
time 249.9 ms (243.1 ms .. 259.4 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 246.9 ms (245.5 ms .. 248.6 ms)
std dev 2.080 ms (1.056 ms .. 3.051 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking initsT
time 259.8 ms (252.5 ms .. 266.8 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 262.6 ms (260.7 ms .. 264.4 ms)
std dev 2.408 ms (1.423 ms .. 3.065 ms)
variance introduced by outliers: 16% (moderately inflated)
initsA
isn't lazy like the others, so it's hard to make a meaningful comparison.
It is lazy like the others. It uses unsafe primitives but it is entirely invisible for a user of inits
, however they choose to force it.
My apologies. I missed one of the unsafePerformIO
invocations.
Ah, right... $O(n)$ total cost is possible in spite of my $\Omega(n \log{numChunks})$ lower bound proof above, by cheating a little and returning slices of not-yet-fully-initialized buffers.
558 made me notice that while the implementation of
inits
is overall optimal ( $O(n^2)$ ), it has an awkward property that evaluating the ith text in the result takes $O(i^2)$ time. Ideally this would just take $O(i)$ time, since it involves reaching the ith element in a list and the element is a text of length i.I believe
Data.List.inits
has solved this problem using a queue: https://hackage.haskell.org/package/base-4.19.0.0/docs/src/Data.OldList.html#inits The same could be adopted here.