noprompt / meander

Tools for transparent data transformation
MIT License
923 stars 55 forks source link

Gather operator #59

Closed jimmyhmiller closed 5 years ago

jimmyhmiller commented 5 years ago

m/scan is an incredibly useful operator when searching for things. It would be great to have something like it for match. More or less, this would act like filter. I made an operator and called it gather. Here is the preliminary definition.

(r.syntax/defsyntax gather
  ([pattern]
   (gather pattern '...))
  ([pattern repeat-pattern]
   (case (::r.syntax/phase &env)
     :meander/match
     `(m/seqable (m/or ~pattern ~'_) ~repeat-pattern)
     ;; else
     `[~pattern ...])))

This definition works in the simple case (there is in epsilon bug with or and maps being combined, but we will ignore that). Below are some simple cases.

(m/match [1 2 3 4 5 6]
  (gather (m/pred even? !xs) ...)
  !xs)
;; => [2 4 6

(m/match [[1 2 3] [1 2] [3 2] [56] 3]
  (gather [_ !xs] ...)
  !xs)
;; => [2 2]

The problem with this definition is that the repeat-pattern doesn't work the way you'd intuitively think it would. For example:


(m/match [1 1 1 1 1 1 1 1]
  (gather (m/pred even? !xs) ..1)
  !xs)

;; => []

Ideally, this pattern would fail. Because what we want to express is we found 1 or more even numbers, not just one or more item in the collection. So far, I have been unable to figure out a way to write this.

I was hoping maybe someone could help. I think this is an incredibly useful operator, but only if the repeat symantics work properly. This is especially important when capturing groups (ie ..!n).

noprompt commented 5 years ago

There are a few things that we should consider for this operator.

First, if the first argument pattern contains unbound logic variables we'll get an error when we check the syntax.

(match [1 2 3]
  (gather ?x)
  ?x)
;; =>
;; Every pattern of an or pattern must have references to the same
;; unbound logic variables.

Fortunately, there is information in the exception to help us figure out that gather is responsible.

{:env {:lvrs #{},
       :mvrs #{},
       :refs {},
       :path [(gather ?x)
              (meander.match.syntax.epsilon/apply clojure.core/seq ((meander.epsilon/or ?x _)))
              ((meander.epsilon/or ?x _))]},
 :problems [{:pattern _, :absent #{?x}}],
 :syntax-trace [(meander.epsilon/or ?x _)
                ((meander.epsilon/or ?x _))
                (meander.match.syntax.epsilon/apply clojure.core/seq ((meander.epsilon/or ?x _)))
                (gather ?x)]}

It is possible for the parsing step to collect this information and make it available to syntax definitions but we'll need to make some changes to how we operate the parser.

I think we should constrain the first argument to be free of logic variables.

The second consideration I have is with the second argument. I think it should be a nat-int? (explicit amount), a memory variable (account for how many were collected), or a logic variable (was the right amount collected?). This extension should construct the ellipsis.

The third consideration I have — and this should be true for all operators we expose in the primary namespace — is that we always default to &form when the phase does not match.

Regarding

(m/match [1 1 1 1 1 1 1 1]
  (gather (m/pred even? !xs) ..1)
  !xs)

;; => []

Ideally, this pattern would fail.

This pattern won't fail unless the target is empty. As gather is defined the behavior is correct. However, if you wish to say, instead, that the number of items you collected via the gather is some number n that requires a different approach.

Essentially what you need to do here is create a fresh memory variable for the gathering step and then, depending on the second argument, use a guard to verify the amount collected.

Here's a working example of this defined in the meander.epsilon namespace.

(r.syntax/defsyntax gather
  ([pattern]
   (case (::r.syntax/phase &env)
     :meander/match
     `(seqable (or ~pattern ~'_) ...)

     ;; else
     &form))
  ([pattern min]
   (case (::r.syntax/phase &env)
     :meander/match
     (clj/let [ellipsis (clj/symbol (str ".." min))
               collector (gensym "!")]
       `(and (seqable (or (and ~pattern ~collector) ~'_) ~ellipsis)
             (guard (<= ~min (count ~collector)))))

     ;; else
     &form)))

Note this implementation only assumes only the nat-int? case for min and does not require the constraints I recommended.

Here are some examples:

(match [1 2 1 1 1 4 1 1]
  (gather (pred even? !evens) 2)
  !evens)
;; => [2 4]

(match [1 1 1 1 1 4 1 1]
  (gather (pred even? !evens) 2)
  !evens)
;; => non exhaustive pattern match
jimmyhmiller commented 5 years ago

Thanks for the thorough dive. The idea of creating a fresh memory variables is interesting. Though it does seem a bit wasteful. It is a collection I am making that will just be thrown away. I tried something similar with mutatable variables. But they don't seem actually mutatable, just rebindable?

I will explore using that idea implementation wise though at least as a starting point, thanks!

As for the nat-int suggestion, I would rather not do that because I want to be able to write (gather (pred even? !xs) ..?n)). So I would like to keep it as a repeat operator.

noprompt commented 5 years ago

Though it does seem a bit wasteful. It is a collection I am making that will just be thrown away.

I hear you. We could use a volatile!.

(r.syntax/defsyntax gather
  ([pattern]
   (case (::r.syntax/phase &env)
     :meander/match
     `(seqable (or ~pattern ~'_) ...)

     ;; else
     &form))
  ([pattern min]
   (case (::r.syntax/phase &env)
     :meander/match
     (clj/let [ellipsis (clj/symbol (str ".." min))
               counter (gensym "?")]
       `(let [~counter (volatile! 0)]
          (and (seqable (or (and ~pattern (let [~'_ (vswap! ~counter inc)]))
                            ~'_)
                        ~ellipsis)
               (guard (<= ~min (deref ~counter))))))

     ;; else
     &form)))

In this case there would likely be less garbage. There is no apparent difference with respect to performance between the two implementations. I also tried transient with the same results. A volatile! would be the cheapest.

So I would like to keep it as a repeat operator.

I'm okay with accepting the ellipsis in addition to the other types I mentioned e.g. the second argument may be either nat-int?, memory variable, logic variable, or ellipsis.

(match [1 2 1 1 1 4 1 1]
  (let [?n 2]
    (gather (pred even? !evens) ?n))
  !evens)
;; <=>
(match [1 2 1 1 1 4 1 1]
  (let [?n 2]
    (gather (pred even? !evens) ..?n))
  !evens)
;; => [2 4]
noprompt commented 5 years ago

In the case of this operator, I'm not a fan of the allowance of the ellipsis because it is not being used as a modifier as usual i.e. as it can be used in seqable. It could be confusing but, hey, that what documentation is for, right? But I'm still open to allowing it because I see little harm and because it would add very little additional code. Once we've determined its an ellipsis and parsed out a nat-int?, logic variable, or memory variable, the subsequent code would be usable by those three other data types anyway.

jimmyhmiller commented 5 years ago

Here is a simplified version of what I was attempting to do that made me believe that the ellipsis should behave that way.

(m/rewrite [{:name "jimmy"
             :addresses [{:address "Stuff"
                          :business true}
                         {:address "Other Stuff"
                          :business true}
                         {:address "Stuff"
                          :business false}]}
            {:name nil
             :addresses [{:address "Stuff2"
                          :business true}
                         {:address "Stuff2"
                          :business false}]}
             {:name "falcon"
             :addresses [{:address "Falcons Address"
                          :business true}
                         {:address "Stuff2"
                          :business false}]}]

  (gather {:name (m/pred string? !names)
           :addresses (gather {:business true
                               :as !addresses} ..!n)})

  [!names [!addresses ..!n] ...])

To me this very succinctly says to find all the business addresses for people with names. Using the ..!n to capture how many business addresses there are seems completely reasonable. But if we made it behave the way it does in my initial definition above, we actually get strange behavior. Here is the output.

["jimmy"
 [{:address "Stuff", :business true}
  {:address "Other Stuff", :business true}
  {:address "Falcons Address", :business true}]]

No mention of "falcon" at all. but his address is here. I can of course explain to myself why this is happening, but when would I want that behavior? If I gather just a subset of things as a group, when I substitute, I want that group to only represent the number I captured, the intuitive output this.

["jimmy"
 [{:address "Stuff", :business true}
  {:address "Other Stuff", :business true}]
 "falcon"
 [{:address "Falcons Address", :business true}]]

If we didn't think about the use we'd have for captured groups in substitution then I could almost agree with you, but I think once they come into the picture things become more clear.

noprompt commented 5 years ago

This is all sorted out on the epsilon branch. If we need to revisit this discussion, we'll can reopen this ticket.