haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
315 stars 178 forks source link

Add a Data.Sequence.Strict #752

Open infinity0 opened 3 years ago

infinity0 commented 3 years ago

Data.Sequence is only strict in its length and not its values - analogous to how Data.Map.Lazy is strict in its keys but not its values.

This is causing a space leak in my program, and ghc-heap-view clearly shows unevaluated thunks within the Seq structure. The solution that is most friendly to users of containers such as myself, would be to provide a Data.Sequence.Strict that forces the values before storing them inside the container.

phadej commented 3 years ago

Are you 100% sure that thunks are not due how Seq work, i.e. it has thunks by design to make the structure fast.

infinity0 commented 3 years ago

I will double-check this soon, although the current documentation only says "strict in its length" and does not mention strictness in the values.

treeowl commented 3 years ago

I'm okay with this idea. I would be interested in seeing the code that's leaking; if there's a bug here, I'd like to fix it. There will be thunks in sequences; if they build up, that's likely a problem.

infinity0 commented 3 years ago

The culprit is indeed Data.Sequence - forcing the item before using |> fixes the issue for me. (see the commit linked just above this comment). I can still see a couple of thunks in ghc-heap-view but nowhere near as much as before, and I guess these are the "permissible thunks" that are part of Seq's design.

The fully-running code is a bit long but I can post it if you really want to; I don't think it's that interesting though - effectively a data structure that stores a bunch of Map k (Seq a) that lives for a long time. Certainly I can see how laziness might benefit similar data structures that live for shorter lifetimes in a more "control"-like context rather than a "data"-like context.

treeowl commented 3 years ago

The fix you linked to indicates that the bug is in your code and not Data.Sequence. The advantage of a strict sequence module is that it makes such bugs less likely to occur. The internal laziness in sequences is of a very careful nature, and users should almost never have to worry about it at all.

infinity0 commented 3 years ago

that the bug is in your code and not Data.Sequence. The advantage of a strict sequence module is that it makes such bugs less likely to occur

I don't mean to play the blame game here - just point out that the guarantees around space usage given by the current Data.Sequence is not the most obvious to reason about for users. It is not a particularly hard concept to grasp, but even after one understands it, one needs to spend extra mental effort when using it, which would not be needed with Data.Sequence.Strict - I'd rather use my mental effort on the higher-level application logic i.e. what I wanted to be doing in the first place. It's something that is one of the top complaints currently about Haskell as I can understand, amongst newbies dabbling with it.

Yet an even simpler API would be to have separate strict & lazy data types in the first place (where the fields storing values have ! annotated so they can never be lazy), which is what we were discussing on haskell-strict/strict#22.

tomjaguarpaw commented 2 years ago

We have strict interfaces to Data.Map and Data.IntMap. It's unclear to me why we wouldn't want a strict interface for Data.Sequence too. Is that a counterpoint that someone can elaborate on?

phadej commented 2 years ago

@tomjaguarpaw Data.Sequence time complexities depend on the lazyness.

Read http://www.staff.city.ac.uk/~ross/papers/FingerTree.pdf

The same bounds hold in a persistent setting if subtrees are suspended using lazy evaluation. This ensures that transformations deep in the spine do not take place until a subsequent operation needs to go down that far. Because of the above properties of safe and dangerous digits, by that time enough cheap shallow operations will have been performed to pay for this expensive evaluation.

if you need a TL;DR, Data.Sequence wouldn't work as advertised if it is made strict.

tomjaguarpaw commented 2 years ago

It can be made strict in its values without being strict in its structure, and therefore without defeating the time complexities. And that is indeed what the OP asked for:

The solution that is most friendly to users of containers such as myself, would be to provide a Data.Sequence.Strict that forces the values before storing them inside the container.

-- https://github.com/haskell/containers/issues/752#issue-726329570

phadej commented 2 years ago

It can be made strict in its values without being strict in its structure, and therefore without defeating the time complexities. And that is indeed what the OP asked for:

No, it cannot. Consider map. To force all the values it would need to traverse and force whole structure. Inserting is possible to make strict, but strict map is impossible without forcing the spine.

tomjaguarpaw commented 2 years ago

I see. I think there is a difference between how you and I understand "strict in values". I don't know what @infinity0 intended, but I mean that when the "cell holding the value" (whatever that is) is evaluated then the value is evaluated too.

So, I would call

data ValueStrictList a = Cons !a (ValueStrictList a) | Nil

"strict in values". Perhaps you would not.

In any case, that kind of behaviour is what I'm suggesting we need in a putative Data.Sequence.Strict, although in practice perhaps I support the use of Data.Strict.Sequence more strongly.

treeowl commented 2 years ago

I think the way that would match the rest of the package would use something equivalent to

map f = getSolo . traverse (\x -> Solo $! f x)
phadej commented 2 years ago

The

data ValueStrictList a = Cons !a (ValueStrictList a) | Nil

wil contain value thunks

Cons 1 <thunk of Cons (some-computation 2) ...>

Lazy spine but strict values doesn't make sense to me.


Data.Strict.Sequence is fine. However don't trust the time complexities. They are amortized ones, and in strict structure there is no such think as amortized time complexities. That said, if you just concatenate and split and don't really map/modify elements (as Seq is often used), then it's probably fine (at least I failed to quickly construct an example where it would be visible) - however there is no proof that something advertised as O(1) doesn't become O(log n) or even O(n).

It would be nice to have e.g. Kaplan & Tarjan (Purely Functional, Real-Time Deques with Catenation) deque, where the bounds are worst case and structure is spine strict (as far as I understand the paper). (EDIT: AFAIU they will be different then Data.Sequence for some operation, e.g. no efficient indexing?)

Now we turn to recent related work. In work independent of ours, Okasaki [1996; 1998] has devised a confluently persistent implementation of catenable stacks (or steques). His implementation is not real-time but gives constant amortized time bounds per operation. It is also not purely functional, but uses memoization. Okasaki uses rooted trees to represent the stacks. Elements are popped using a memoized version of the path reversal technique previously used in a data structure for the disjoint set union problem [Tarjan and Van Leeuwen 1984]. Though Okasaki’s solution is neither real-time nor purely functional, it is simpler than ours. Extending Okasaki’s method to the case of deques is an open problem. After seeing an early version of our work [Kaplan and Tarjan 1996], Okasaki [1996; 1998] observed that if amortized time bounds suffice and memoization is allowed, then all of our data structures can be considerably simplified. The idea is to perform fixes in a lazy fashion, using memoization to record the results. This avoids the need to maintain the “stack of stacks” structures in our representations, and also allows the buffers to be shorter. Okasaki called the resulting general method “implicit recursive slow-down.”


So in summary, read the literature, implement proper data structures, don't just sprinkle bangs around.

tomjaguarpaw commented 2 years ago

Lazy spine but strict values doesn't make sense to me.

Interesting. It makes sense to me. I wonder why the difference.

treeowl commented 2 years ago

@phadej, it's entirely reasonable to talk about a strict version of Data.Sequence. Yes, there will be thunks in the structure, but if you do things right, none of those thunks will produce an element value; they'll only perform restructuring work. For example,

(!x) Strict.<| xs = x Lazy.<| xs
xs Strict.>< ys = xs Lazy.>< ys
Strict.adjust = Lazy.adjust'
mapM f = fmap (map id) . traverse f
-- as above
map f = getSolo . traverse (\a -> Solo $! f a) -- This is probably not the best implementation; you'd really want a custom one.

The amortized bounds will all be perfectly fine. Mapping, traversal, zipping, etc., will be more expensive in practice because they allocate monolithic structures, but that's allowed by the amortized bounds.

phadej commented 2 years ago

@treeowl, if you say so, then please make Data.Sequence.Strict asap and close that issue.

I'll be very happy to be wrong here.

map f = getSolo . traverse (\a -> Solo $! f a)

is a very good observation. I somehow got locked into thinking that strict map should be implemented using lazy map (and that's impossible AFAIK), but with traverse you indeed can force the values.

treeowl commented 2 years ago

@phadej, yeah, only the thunks along the spine matter for amortization. I'll try to do it soon, unless someone else gets there first.

infinity0 commented 2 years ago

I actually already implemented a value-strict version of Sequence in strict-containers, here: https://hackage.haskell.org/package/strict-containers-0.1/docs/Data-Strict-Sequence.html

Like everything else there, Data.Strict.Sequence is auto-generated from Data.Sequence in this package, using this patch. I'd be happy to drop it there in favour of putting it in this package.

infinity0 commented 2 years ago

don't just sprinkle bangs around.

I think "sprinkling bangs" is fine if you understand the purpose of each bang/lazy part. I wrote the patch mentioned above after specifically considering the part of the paper that says:

From the paper: "In a strict language that provides a lazy evaluation primitive, we need only suspend the middle subtree of each Deep node, so only Θ(log n) suspensions are required in a tree of size n. Even in a lazy language, some space could be saved in practice by ensuring that these were the only suspensions in the tree, for example by using Haskell’s strictness annotations." i.e. making everything else strict preserves the performance bounds

I didn't change the fmap implementation, so I think the lazy behaviour should be preserved - as long as you don't access all the values, the strictness of values is hidden underneath the laziness of the rest of the data structure. (Likewise for other utilities)

phadej commented 2 years ago

@infinity0 good to hear. Could you add that clarification to https://hackage.haskell.org/package/strict-containers-0.1/docs/Data-Strict-Sequence.html, I was wrongly assuming (I'm sorry) that Data.Strict.Sequence is spine strict as well, but there are still thunks in there.

Also fmap behavior may surprise people.

infinity0 commented 2 years ago

Sure, I've filed haskellari/strict-containers#4 which I'll get to next time I get into Haskell stuff in more depth. Actually regarding the fmap behaviour you are right, it would be partially-lazy which is not great & surprising for a value-strict data structure, so that does need to be tweaked for Data.Strict.Sequence.

I suppose the contract for a strict container would be, if the top-level structure is in WHNF then the values should also be in WHNF. So fmap and other utilities should be aiming for that.

tomjaguarpaw commented 2 years ago

I suppose the contract for a strict container would be, if the top-level structure is in WHNF then the values should also be in WHNF. So fmap and other utilities should be aiming for that.

I agree that is the contract that I would expect.

infinity0 commented 2 years ago

I'm fairly sure that utilities that act on a single specific item in a Data.Strict.Sequence.Seq would already satisfy that contract, since (in the process of creating the new structure) they have to traverse the old structure and get to the value, thereby forcing it. But non-item-specific utilities including whole-container utilities like fmap currently do not, and need further work.