Wrong complexity in BatDequeue

v-gb commented 12 years ago

The main operations of BatDequeue, front and rear should be constant time (and are documented as being amortized constant time) but they are not.

When you take a look at the implementation, you can see that if you build a dequeue of size n by adding all the elements at the front (with cons), and then

call rear on the queue
call front on the remaining queue
call rear on the remaing queue etc. then all the calls to rear and front take linear time in the current size of the queue, there is no amortization.

This should be solved by changing front so that when the front list is empty, it puts only half of the rear list in front, instead of putting the whole rear list (and the symmetric change in rear).

Even with this, the complexity would be amortized constant time only for 'single threaded' usage of the structure. You would need some memoization as in described in Okasaki's Purely Functional Datastructures to have an amortized constant time complexity in a persistent setting. If the current behaviour is the desired one, I think the documentation should say so.

gasche commented 12 years ago

No it's not. If you are willing to give a hand at fixing this issue, patches would be welcome. But if the bookkeeping overhead of really-amortized structures¹ is too high, we'll consider keeping the current implementation and specifying that it is only amortized over one-directional usage patterns.

¹: or even constant time? I remember Okasaki has a real constant-time queue data structure, maybe the same feat can be performed on deques?

v-gb commented 12 years ago

So I compared a really amortized queue and a not really amortized queue (like the one in Batteries) : the really amortized version is 4 times slower. With deques, you would probably end up with the same kind of slowdown. And yes, I think he also has worst case O(1) dequeues, but it is probably even slower.

So, should I simply patch Batteries' deque to make it amortized constant case in the ephemeral usage pattern?

Out of curiosity, I took a look at jane street's core lib, and their functional queue has the same 'not really amortized complexity' problem.

gasche commented 12 years ago

the really amortized version is 4 times slower

Could you put the code somewhere? Maybe there is some optimization left to be done.

So, should I simply patch Batteries' deque to make it amortized constant case in the ephemeral usage pattern?

I'm not sure what you mean: making the documentation explicit without changing the code (in which case the usage pattern is not just "ephemeral", but moreover one-directional; your example above is also ephemeral), or change the code to the "really amortized" version?

v-gb commented 12 years ago

So I put the code there : https://gist.github.com/1506003. (first file)

I don't understand the difference you mean between ephemeral usage pattern and one-directional usage pattern. What I meant by patching the deque is changing the implementation of rear and front so that the example in the second file in the gist has linear complexity in Sys.argv.(1) instead of quadratic.

thelema commented 12 years ago

Benchmark added as benchsuite/deque.ml that covers both the linear comparison between not-quite-amortized and really amortized as well as the quadratic behavior of Deque when pulling from alternate ends. This benchmark shows on my computer:

Time to grow and deconstruct at the opposite end a deque of 10K elements
Not-really amortized Deque (2.61 ms) is 56.9% faster than
Really amortized Deque (6.06 ms)

As well as a per-element cost for alternate removal in batDeque in the range of 142ns to 7.19ns for deques of size 10 to 1000.

gasche commented 12 years ago

Here's what I think could be tried: when using the existing BatDeque implementation, on a front when front is empty, or a rear when rear is empty, instead of flipping the whole list of elements from front to rear or vice versa, flip only half of them.

v-gb commented 12 years ago

Oops, sorry about the closing/reopening thing.

Yes, reversing only half of the lists instead of the whole lists is what I suggested in the initial comment. There would still be a choice between keeping that version or making it really amortized... (which would make the following example linear too:

let () =
  let rec loop q = function
    | 0 -> q
    | j -> loop (Deque.cons j q) (j - 1) in
  let n = int_of_string Sys.argv.(1) in
  let q = loop Deque.empty n in
  for i = 1 to n do
    ignore (Deque.rear q)
  done

)

After seeing thelema's comment, I rechecked the benchmarks. The slowdown depends on the size of the queue : 4x slower for 1k elements, 2x slower for 10k elements, 1.5x slower for 100k elements. With 100k elements, 85% of what is being measured is gc time, though.

v-gb commented 12 years ago

So I tested a few different implementations :

the deque from Batteries
the deque from Batteries, patched to avoid the quadratic behaviour (by transferring only half the lists when rebalancing)
the banker's deque from okasaki's thesis
the implicit deque, also from okasaki (not worth it, so not mentioned anymore)
a variation on a bootstrapped queue, also from okasaki (referred to as 2levelqueue)
a 2-3 finger tree, from Hinze & Paterson (with 3 different levels of laziness, full with laziness everywhere, partial with laziness only when deconstructing, none with no laziness)

The 2levelqueue is really a queue and not a deque, but it was fast, so I wanted to compare it anyway.

I benchmarked these structures on 4 pieces of code and with differents size of queues, as shown in the graphs here: http://postimage.org/gallery/52ke86w/c1ec435a/. The lines that are thick are for datastructures that are fully amortized, the line that are thin are for structure that are partially amortized and the dotted line is for the current queue in Batteries which is neither.

The graph called "queue snoc front rear" is the example with the quadratic behaviour showed before. The graph called "queue snoc front" is the other bench showed in one of the paste above. In these two benches, fully amortized implementation are at a disadvantage because the deques are used in a the case where partially amortized implementation work anyway.

The graph called "queue snoc front persist 0,5" is a bench where every element of the first half of the deque is looked at twice. The graph called "queue snoc front persist 0,1" is a bench where every element of the first 10% of the deque is looked at ten times. In these benches, fully persistent implementation are in a better position.

The conclusion seems to be that having fully amortized datastructures is useless. Granted, in "queue snoc front persist 0,1", both versions of batdeque sort of blow up, but I can't see any case where this kind of behaviour would happen with real code. All in all, the part of the current issue where I complain about the non really amortized implementation seems pretty useless, a posteriori.

One thing to be noted though, is that the finger tree 'none' (without any laziness) is systematically faster than everything else for large deques, and never far away from the fastest structure for small deques. And as a plus side, operations on finger trees are log(n) in the worst case (instead of O(n) for batdeque for instance), which means that even if you use it in a way that break all the amortization, you will have something O(n log n) and not something O(n**2). I think finger tree should replace the current deques.

And while I am here, I might as well give some feedback: the interface of BatDeque with front and rear returning option is fine when writing things like the benchs, but when writing other datastructures, you also want interfaces with functions that raise exceptions (or else you end up writing assert false in the None case in your code, which is not better).

gasche commented 12 years ago

That is a truly excellent work, thank you very much. I felt ashamed to let the issue float around because I don't have time to work on it right now, and I'm delighted to see that you were cooking such delicate work.

My first reaction is to take the Finger Tree and be happy with it. However, on second thought, we must look more carefully into what happens in the Q.length and n Q.snoc then n Q.front benchmarks: our queues are very often used for relatively small sizes, and if the performance degradation is 2x, that is something that must be accounted for.

Here is what I suggest :

we look more carefully into the "medium-length queue on the limited but frequent operation of non-bidirectional uses"
regardless of any conclusion we draw with the first step, we add a FingerTree module to Batteries, that exposes the algorithmic datastructure in full (instead of hiding it under a Deque interface), like we already (try to) do with BatAvlTree

Could you show the code for your Finger tree implementation? Did you reuse parts of Mathieu Sozeau's Coq implementation?

Remark: none of Batteries current most frequent contributors is extremely interested in playing with fancy algorithms and data structure (or at least, none of us spend a non-neglectible fraction of our time doing just that). If you're willing to contribute more frequently on these kinds of things, that would be very appreciated.

In the past we had the intention of integrating the JSP 2008 OCaml Reins project of persistent libraries, but we stopped thinking at it after an effort to downsize Batteries and our lack of time. If you are interested in this kind of things, you should come over discuss it on the mailing-list ( batteries-devel@lists.forge.ocamlcore.org ).

v-gb commented 12 years ago

Some forgotten feedback: I think it is very annoying to have snoc and cons with the argument swapped, front and rear with the return type swapped. Just like I don't like List.fold_left and List.fold_right.

My first reaction is to take the Finger Tree and be happy with it. However, on second thought, we must look more carefully into what happens in the Q.length and n Q.snoc then n Q.front benchmarks: our queues are very often used for relatively small sizes, and if the performance degradation is 2x, that is something that must be accounted for.

If the performance degradation is too much for queues, then the 2levelqueue structure is even closer to batdequeue for small queues while behaving better asymptotically and well in all use cases. I should probably show the data, not just the graph, it would be more convenient: https://gist.github.com/1543810.

Could you show the code for your Finger tree implementation? Did you reuse parts of Mathieu Sozeau's Coq implementation?

No I did not reuse anything. I assumed the caml extracted from his implementation would be inefficient because it would build proof terms. And I started looking at the coq file just in case, and that wasn't helping because I wanted to see where the laziness was, and I am not sure there was any. Anyway, the code is here: https://gist.github.com/1543810. This version of the finger tree is the initial one described in the paper, there are no annotations on nodes.

Remark: none of Batteries current most frequent contributors is extremely interested in playing with fancy algorithms and data structure (or at least, none of us spend a non-neglectible fraction of our time doing just that). If you're willing to contribute more frequently on these kinds of things, that would be very appreciated.

Are you missing some structures in particular? I have a functional hash table (same thing as the persistentmap in clojure, or the hashmap in www.johantibell.com/files/hiw-2011.pdf), but I still don't know if it if of any use: if you want something fast, you might as well go for a hashtbl, right?

In the past we had the intention of integrating the JSP 2008 OCaml Reins project of persistent libraries, but we stopped thinking at it after an effort to downsize Batteries and our lack of time. If you are interested in this kind of things, you should come over discuss it on the mailing-list ( batteries-devel@lists.forge.ocamlcore.org ).

Huh, never heard of that library, but that looks interesting.

v-gb commented 12 years ago

Oops, I thought I had answered a few days ago, but I had probably only previewed my comment.

Some forgotten feedback: I think it is very annoying to have snoc and cons with the argument swapped, front and rear with the return type swapped. Just like I don't like List.fold_left and List.fold_right.

My first reaction is to take the Finger Tree and be happy with it. However, on second thought, we must look more carefully into what happens in the Q.length and n Q.snoc then n Q.front benchmarks: our queues are very often used for relatively small sizes, and if the performance degradation is 2x, that is something that must be accounted for.

Note that if you care about operations of queues only, it is more efficient to have a separate queue and deque implementation, since the queue doesn't need to transfert only half the lists.

Could you show the code for your Finger tree implementation? Did you reuse parts of Mathieu Sozeau's Coq implementation?

No I did not reuse anything. I assumed the caml extracted from his implementation would be inefficient because it would build proof terms. And I started looking at the coq file just in case, and that wasn't helping because I wanted to see where the laziness was, and I am not sure there was any. Anyway, the code is here: https://gist.github.com/1543810. This version of the finger tree is the initial one described in the paper, there are no annotations on nodes. I added the annotations afterwards, and it slows it done noticibly (it is a little slower than the fingertreepartial).

Remark: none of Batteries current most frequent contributors is extremely interested in playing with fancy algorithms and data structure (or at least, none of us spend a non-neglectible fraction of our time doing just that). If you're willing to contribute more frequently on these kinds of things, that would be very appreciated.

Are you missing some structures in particular? I have a functional hash table (same thing as the persistentmap in clojure, or the hashmap in www.johantibell.com/files/hiw-2011.pdf), but I still don't know if it if of any use: if you want something fast, you might as well go for a hashtbl, right?

In the past we had the intention of integrating the JSP 2008 OCaml Reins project of persistent libraries, but we stopped thinking at it after an effort to downsize Batteries and our lack of time. If you are interested in this kind of things, you should come over discuss it on the mailing-list ( batteries-devel@lists.forge.ocamlcore.org ).

Huh, never heard of that library, but that looks interesting.

thelema commented 12 years ago

I'd like to blame an error on your part, but I have to admit that I read and didn't reply to your message. I like the ideas you're proposing, and will accept into batteries patches along these lines. If you have time to 1) fix batDequeue's performance and 2) add BatFingerTree and 3) change argument order on cons, snoc, etc. (most important to be done soon before v2.0 release), please send pull requests. They will get faster processing, I promise.

gasche commented 12 years ago

I'm not totally convinced by the idea to change cons and snoc order. I wouldn't be pained to see them change personally, but I find them quite reasonable -- and I dare to think the argument order on fold_left and fold_right is the right one¹, Haskell people got them wrong. I'm not sure it's worth it breaking backward compatibility on such aesthetic tastes. But maybe you have stronger arguments as to why it would be more consistent to change; that could be convincing.

¹: fold_right is a morphism from 'a -> 'b -> 'b to 'a t -> 'b -> 'b, because cons has type 'a -> 'a list -> 'a list, and fold_left has its accumulator on the left 'b -> 'a -> 'b, so naturally transforms into 'b -> 'a t -> 'b.

v-gb commented 12 years ago

I didn't know how folding was done in haskell, and foldr is definitly weird because it doesn't just lift an operation on 'a on an operation on 'a list, it also swaps arguments.

What I don't like is:

I find it annoying to have to swap the arguments and parameters when changing a fold_left into a fold_right, and vice versa. Plus I use List.fold_left all the time, and pretty often I mix up arguments when I fold on a Hashtbl or on a Map.
I don't think the argument that fold_right has the same shape as cons holds, because Hashtbl.fold doesn't have the same shape as Hashtbl.add.
It prevents you from having a List.fold_left, List.fold_right and List.fold, where List.fold would have an unspecified traversal order. This is very useful, for the same reason that it is very useful to have a function similar to List.append but that doesn't guarantee the order: instead of making it look like to people reading your code that the order of the operations matters, you explicitly say that you don't care.

I don't have argument that say that swapping cons or snoc arguments would be more consistent. I just think that the name of the functions already say what they do, so you don't need to also choose the argument order to reflect what the functions do. Instead you can aim for uniformity by making cons, snoc and front, rear and fold_left, fold_right etc. have the same type. Obviously, it is way too late to change the type of fold_left and fold_right, so perhaps it is actually better to keep the rest as is. (as a side note, I don't think a lot of compatibility would being broken here, since the deque is a few weeks old, if I am not mistaken)

For now, I guess I'll simply patch deques, and add finger trees. When is supposed to be the release?

thelema commented 12 years ago

@gasche Well, maybe the jane street core people are right and fold_left should have its accumulator in the same place as fold_right. I appreciate @sliquister bringing up argument order, as it trips me up often enough. I'm tempted to have the default for all non-stdlib folds be to use ~f and ~x0 as names for the folding function and initial value, as well as consistent order of the structure being folded over as the first argument.

ocaml-batteries-team / batteries-included

Wrong complexity in BatDequeue #192