noprompt / meander

Tools for transparent data transformation
MIT License
921 stars 54 forks source link

`scan` and `vscan` compiled poorly #45

Closed noprompt closed 5 years ago

noprompt commented 5 years ago

The scan and vscan operators currently have no custom compilation. They are parsed as if one had written them as

(app seq (_ ... <pattern> . _ ...))

and

[_ ... <pattern> . _ ...]

respectively. Thus they have no AST node of their own and are compiled through usual mechanisms for compiling seq and vector patterns. In this case, it results in less than optimal code for something which gives us enough information to do better.

Because we know scan and vscan are looking for subsequences of a given length we can simply use lean on the Clojure standard library.

For subsequences of length 1 we can use the target collection itself as the search space using reduce.

;; match/find
(reduce
 (fn [_ x]
   (let [result (pattern-match x)]
     (if (identical? result FAIL)
       result
       (reduced result))))
 FAIL
 target-coll)

;; search
(mapcat
 (fn [_ x]
   (let [results (pattern-match x)]
     (if (fail? results)
       nil
       results)))
 target-coll)

For subsequences of length greater than one we could use clojure.core/partition for the search space.

This will require the following:

scan-cat should look for something which resembles

{:tag :prt,
 :left {:tag :drp},
 :right {:tag :prt,
         :left {:tag :cat,
                :elements (,,,)},
         :right {:tag :prt,
                 :left {:tag :drp},
                 :right {:tag :cat, :elements []}}}}

and return

{:tag :cat,
 :elements (,,,)}

and nil otherwise. If we get a result then we can compile it using op-search/op-find with the search spaces suggested. This should probably happen in the [true true] clause of :prt compilation.

What's nice about this approach is that we can improve the "handwritten" versions as well.

noprompt commented 5 years ago

@jimmyhmiller @xificurC I just applied a patch to make the performance of scan, vscan, and patterns which look similar e.g. [_ ... thing . _ ...] perform better.

(time
 (dotimes [_ 100000]
   (r.match/find [1 2 3 4 5]
     [_ ... 4 . _ ...]
     :four)))
;; => :four
;; Old
"Elapsed time: 389.033882 msecs"
;; New
"Elapsed time: 61.076688 msecs"

This is still going to be slower than filter, some, etc. by an order of magnitude for these kinds of searches and it is still possible to improve this further by doing a bit more analysis. More to come! 😅