Open infinity0 opened 3 years ago
Are you 100% sure that thunks are not due how Seq
work, i.e. it has thunks by design to make the structure fast.
I will double-check this soon, although the current documentation only says "strict in its length" and does not mention strictness in the values.
I'm okay with this idea. I would be interested in seeing the code that's leaking; if there's a bug here, I'd like to fix it. There will be thunks in sequences; if they build up, that's likely a problem.
The culprit is indeed Data.Sequence - forcing the item before using |>
fixes the issue for me. (see the commit linked just above this comment). I can still see a couple of thunks in ghc-heap-view but nowhere near as much as before, and I guess these are the "permissible thunks" that are part of Seq's design.
The fully-running code is a bit long but I can post it if you really want to; I don't think it's that interesting though - effectively a data structure that stores a bunch of Map k (Seq a)
that lives for a long time. Certainly I can see how laziness might benefit similar data structures that live for shorter lifetimes in a more "control"-like context rather than a "data"-like context.
The fix you linked to indicates that the bug is in your code and not Data.Sequence
. The advantage of a strict sequence module is that it makes such bugs less likely to occur. The internal laziness in sequences is of a very careful nature, and users should almost never have to worry about it at all.
that the bug is in your code and not Data.Sequence. The advantage of a strict sequence module is that it makes such bugs less likely to occur
I don't mean to play the blame game here - just point out that the guarantees around space usage given by the current Data.Sequence
is not the most obvious to reason about for users. It is not a particularly hard concept to grasp, but even after one understands it, one needs to spend extra mental effort when using it, which would not be needed with Data.Sequence.Strict
- I'd rather use my mental effort on the higher-level application logic i.e. what I wanted to be doing in the first place. It's something that is one of the top complaints currently about Haskell as I can understand, amongst newbies dabbling with it.
Yet an even simpler API would be to have separate strict & lazy data types in the first place (where the fields storing values have !
annotated so they can never be lazy), which is what we were discussing on haskell-strict/strict#22.
We have strict interfaces to Data.Map
and Data.IntMap
. It's unclear to me why we wouldn't want a strict interface for Data.Sequence
too. Is that a counterpoint that someone can elaborate on?
@tomjaguarpaw Data.Sequence
time complexities depend on the lazyness.
Read http://www.staff.city.ac.uk/~ross/papers/FingerTree.pdf
The same bounds hold in a persistent setting if subtrees are suspended using lazy evaluation. This ensures that transformations deep in the spine do not take place until a subsequent operation needs to go down that far. Because of the above properties of safe and dangerous digits, by that time enough cheap shallow operations will have been performed to pay for this expensive evaluation.
if you need a TL;DR, Data.Sequence
wouldn't work as advertised if it is made strict.
It can be made strict in its values without being strict in its structure, and therefore without defeating the time complexities. And that is indeed what the OP asked for:
The solution that is most friendly to users of containers such as myself, would be to provide a Data.Sequence.Strict that forces the values before storing them inside the container.
-- https://github.com/haskell/containers/issues/752#issue-726329570
It can be made strict in its values without being strict in its structure, and therefore without defeating the time complexities. And that is indeed what the OP asked for:
No, it cannot. Consider map
. To force all the values it would need to traverse and force whole structure. Inserting is possible to make strict, but strict map
is impossible without forcing the spine.
I see. I think there is a difference between how you and I understand "strict in values". I don't know what @infinity0 intended, but I mean that when the "cell holding the value" (whatever that is) is evaluated then the value is evaluated too.
So, I would call
data ValueStrictList a = Cons !a (ValueStrictList a) | Nil
"strict in values". Perhaps you would not.
In any case, that kind of behaviour is what I'm suggesting we need in a putative Data.Sequence.Strict
, although in practice perhaps I support the use of Data.Strict.Sequence
more strongly.
I think the way that would match the rest of the package would use something equivalent to
map f = getSolo . traverse (\x -> Solo $! f x)
The
data ValueStrictList a = Cons !a (ValueStrictList a) | Nil
wil contain value thunks
Cons 1 <thunk of Cons (some-computation 2) ...>
Lazy spine but strict values doesn't make sense to me.
Data.Strict.Sequence is fine. However don't trust the time complexities. They are amortized ones, and in strict structure there is no such think as amortized time complexities. That said, if you just concatenate and split and don't really map/modify elements (as Seq
is often used), then it's probably fine (at least I failed to quickly construct an example where it would be visible) - however there is no proof that something advertised as O(1) doesn't become O(log n) or even O(n).
It would be nice to have e.g. Kaplan & Tarjan (Purely Functional, Real-Time Deques with Catenation) deque, where the bounds are worst case and structure is spine strict (as far as I understand the paper). (EDIT: AFAIU they will be different then Data.Sequence
for some operation, e.g. no efficient indexing?)
Now we turn to recent related work. In work independent of ours, Okasaki [1996; 1998] has devised a confluently persistent implementation of catenable stacks (or steques). His implementation is not real-time but gives constant amortized time bounds per operation. It is also not purely functional, but uses memoization. Okasaki uses rooted trees to represent the stacks. Elements are popped using a memoized version of the path reversal technique previously used in a data structure for the disjoint set union problem [Tarjan and Van Leeuwen 1984]. Though Okasaki’s solution is neither real-time nor purely functional, it is simpler than ours. Extending Okasaki’s method to the case of deques is an open problem. After seeing an early version of our work [Kaplan and Tarjan 1996], Okasaki [1996; 1998] observed that if amortized time bounds suffice and memoization is allowed, then all of our data structures can be considerably simplified. The idea is to perform fixes in a lazy fashion, using memoization to record the results. This avoids the need to maintain the “stack of stacks” structures in our representations, and also allows the buffers to be shorter. Okasaki called the resulting general method “implicit recursive slow-down.”
So in summary, read the literature, implement proper data structures, don't just sprinkle bangs around.
Lazy spine but strict values doesn't make sense to me.
Interesting. It makes sense to me. I wonder why the difference.
@phadej, it's entirely reasonable to talk about a strict version of Data.Sequence
. Yes, there will be thunks in the structure, but if you do things right, none of those thunks will produce an element value; they'll only perform restructuring work. For example,
(!x) Strict.<| xs = x Lazy.<| xs
xs Strict.>< ys = xs Lazy.>< ys
Strict.adjust = Lazy.adjust'
mapM f = fmap (map id) . traverse f
-- as above
map f = getSolo . traverse (\a -> Solo $! f a) -- This is probably not the best implementation; you'd really want a custom one.
The amortized bounds will all be perfectly fine. Mapping, traversal, zipping, etc., will be more expensive in practice because they allocate monolithic structures, but that's allowed by the amortized bounds.
@treeowl, if you say so, then please make Data.Sequence.Strict
asap and close that issue.
I'll be very happy to be wrong here.
map f = getSolo . traverse (\a -> Solo $! f a)
is a very good observation. I somehow got locked into thinking that strict map
should be implemented using lazy map
(and that's impossible AFAIK), but with traverse
you indeed can force the values.
@phadej, yeah, only the thunks along the spine matter for amortization. I'll try to do it soon, unless someone else gets there first.
I actually already implemented a value-strict version of Sequence
in strict-containers
, here: https://hackage.haskell.org/package/strict-containers-0.1/docs/Data-Strict-Sequence.html
Like everything else there, Data.Strict.Sequence
is auto-generated from Data.Sequence
in this package, using this patch. I'd be happy to drop it there in favour of putting it in this package.
don't just sprinkle bangs around.
I think "sprinkling bangs" is fine if you understand the purpose of each bang/lazy part. I wrote the patch mentioned above after specifically considering the part of the paper that says:
From the paper: "In a strict language that provides a lazy evaluation primitive, we need only suspend the middle subtree of each Deep node, so only Θ(log n) suspensions are required in a tree of size n. Even in a lazy language, some space could be saved in practice by ensuring that these were the only suspensions in the tree, for example by using Haskell’s strictness annotations." i.e. making everything else strict preserves the performance bounds
I didn't change the fmap implementation, so I think the lazy behaviour should be preserved - as long as you don't access all the values, the strictness of values is hidden underneath the laziness of the rest of the data structure. (Likewise for other utilities)
@infinity0 good to hear. Could you add that clarification to https://hackage.haskell.org/package/strict-containers-0.1/docs/Data-Strict-Sequence.html, I was wrongly assuming (I'm sorry) that Data.Strict.Sequence
is spine strict as well, but there are still thunks in there.
Also fmap
behavior may surprise people.
Sure, I've filed haskellari/strict-containers#4 which I'll get to next time I get into Haskell stuff in more depth. Actually regarding the fmap behaviour you are right, it would be partially-lazy which is not great & surprising for a value-strict data structure, so that does need to be tweaked for Data.Strict.Sequence.
I suppose the contract for a strict container would be, if the top-level structure is in WHNF then the values should also be in WHNF. So fmap and other utilities should be aiming for that.
I suppose the contract for a strict container would be, if the top-level structure is in WHNF then the values should also be in WHNF. So fmap and other utilities should be aiming for that.
I agree that is the contract that I would expect.
I'm fairly sure that utilities that act on a single specific item in a Data.Strict.Sequence.Seq would already satisfy that contract, since (in the process of creating the new structure) they have to traverse the old structure and get to the value, thereby forcing it. But non-item-specific utilities including whole-container utilities like fmap currently do not, and need further work.
Data.Sequence
is only strict in its length and not its values - analogous to howData.Map.Lazy
is strict in its keys but not its values.This is causing a space leak in my program, and ghc-heap-view clearly shows unevaluated thunks within the
Seq
structure. The solution that is most friendly to users ofcontainers
such as myself, would be to provide aData.Sequence.Strict
that forces the values before storing them inside the container.