ring-clojure / ring-codec

Utility library for encoding and decoding data
MIT License
63 stars 30 forks source link

Improve performance of form-decode #34

Closed bsless closed 3 years ago

bsless commented 3 years ago

Improves performance by avoiding regular expressions and allocation of immutable collections where possible

weavejester commented 3 years ago

What did you benchmark this change to be in comparison to the existing code? How does it compare to splitting on a string? What happens if a tokenizer is used onto to split on & and not =?

bsless commented 3 years ago
weavejester commented 3 years ago

I had it in my head that String.split had an overload for splitting on a char (and therefore avoiding the overhead of a regex), but I was thinking of String.replace. Sorry for the confusion.

I'd appreciate the benchmark results, as it helps document the decisions made in the project. We can trace a commit back to a PR, and from the PR we can see how much impact a particular performance change made.

bsless commented 3 years ago

Apologies for the delay. Here are the changes, presented nicely and reproducible:

(ns user
  (:require
   [ring.util.codec :refer [form-decode]]
   [criterium.core :as cc]))

(def forms
  ["foo"
   "a=b"
   "a=b&c=d"
   "foo+bar"
   "a=b+c"
   "a=b%2Fc"
   "a=b&c"
   "a=&b=c"
   "a&b=c"
   "="
   "a="
   "=b"
   "a=1&b=2&c=3&d=4"
   "a=1&b=2&c=3&d=4&x=1&y=2&z=3&u=4"
   "aaaaaaaaaaaaa=111111111111111&bbbbbbbbbbbbbbb=2222222222222"])

(doseq [form forms]
  (let [res (first (:mean (cc/quick-benchmark (form-decode form) {})))
        [scale unit] (cc/scale-time res)
        mean (* scale res)]
    (println form mean unit)))

;;; foo                             16.165265103567577 ns -> 18.071835515056534 ns
;;; a=b                             281.7666082195549  ns -> 85.34618203356219  ns
;;; a=b&c=d                         520.3155533101125  ns -> 192.88535391228297 ns
;;; foo+bar                         28.89783207896534  ns -> 33.96275410375853  ns
;;; a=b+c                           288.6556301231446  ns -> 100.12228382735458 ns
;;; a=b%2Fc                         323.94066534536466 ns -> 129.9906610438482  ns
;;; a=b&c                           470.7898489852522  ns -> 209.38083310542123 ns
;;; a=&b=c                          561.700251137112   ns -> 204.26021604451992 ns
;;; a&b=c                           481.1326203353287  ns -> 198.07185181140767 ns
;;; =                               252.35371110995692 ns -> 72.2746499966358   ns
;;; a=                              263.9456664837952  ns -> 84.74253809189693  ns
;;; =b                              262.19896201858967 ns -> 85.57113803290105  ns
;;; a=1&b=2&c=3&d=4                 1.0210454463867895 µs -> 509.19558470486334 ns
;;; a=1&b=2&c=3&d=4&x=1&y=2&z=3&u=4 2.0973807618522025 µs -> 1.166763297769531  µs

;;; aaaaaaaaaaaaa=111111111111111&bbbbbbbbbbbbbbb=2222222222222
;;; 840.1896265647041  ns -> 471.0025293913937  ns
bsless commented 3 years ago

@weavejester this one is also ready for review

weavejester commented 3 years ago

Let's see if we can't factor out the tokenizer a little. What if we created a reducible:

(defn tokenized [s delim]
  (reify clojure.lang.IReduceInit
    (reduce [_ f init]
      (let [tokenizer (StringTokenizer. s delim)]
        (loop [result init]
          (if (.hasMoreTokens tokenizer)
            (recur (f result (.nextToken tokenizer)))
            result))))))

Then we can just replace (str/split encoded #"&") with (tokenized encoded "&") in the original code.

bsless commented 3 years ago

This would only address the outer tokenizer loop, though. Pros:

I don't find tokenizers and interop too noisy today. If the goal is enhancing readability, there are a few alternatives:

  1. Reduce externally, keep internal loop the same: (see note in the end)
(defn form-decode
  "Decode the supplied www-form-urlencoded string using the specified encoding,
  or UTF-8 by default. If the encoded value is a string, a string is returned.
  If the encoded value is a map of parameters, a map is returned."
  ([encoded]
   (form-decode encoded "UTF-8"))
  ([^String encoded encoding]
   (if-not (.contains encoded "=")
     (form-decode-str encoded encoding)
     (reduce
      (fn [m param]
        (let [param-tokenizer (StringTokenizer. param "=")
              k (if (.startsWith param "=")
                  ""
                  (if (.hasMoreTokens param-tokenizer)
                    (.nextToken param-tokenizer)
                    ""))
              v (if (.hasMoreTokens param-tokenizer) (.nextToken param-tokenizer) "")
              k (form-decode-str k encoding)
              v (form-decode-str v encoding)]
          (if (and k v)
            (assoc-conj m k v)
            m)))
      {}
      (tokenized encoded #"&")))))
  1. Try to improve readability around most optimal code: This solution emits exactly the same bytecode as the original
(definline tokenizer
  [^String s ^String token]
  `(StringTokenizer. ~s ~token))

(definline has-more?
  [^StringTokenizer st]
  `(.hasMoreTokens ~st))

(defn next-token
  {:inline
   (fn
     ([st] `(.nextToken ~st))
     ([st nf] `(if (has-more? ~st) (.nextToken ~st) ~nf)))
   :inline-arities #{1 2}}
  ([^StringTokenizer st]
   (.nextToken st))
  ([^StringTokenizer st nf]
   (if (has-more? st) (.nextToken st) nf)))

(defn form-decode
  "Decode the supplied www-form-urlencoded string using the specified encoding,
  or UTF-8 by default. If the encoded value is a string, a string is returned.
  If the encoded value is a map of parameters, a map is returned."
  ([encoded]
   (form-decode encoded "UTF-8"))
  ([^String encoded encoding]
   (if-not (.contains encoded "=")
     (form-decode-str encoded encoding)
     (let [tokenizer (tokenizer encoded "&")]
       (loop [m {}]
         (if (has-more? tokenizer)
           (let [^String param (next-token tokenizer)
                 ^StringTokenizer param-tokenizer (tokenizer param "=")
                 k (if (.startsWith param "=")
                     ""
                     (next-token param-tokenizer ""))
                 k (form-decode-str k encoding)
                 v (form-decode-str (next-token param-tokenizer "") encoding)]
             (recur (if (and k v) (assoc-conj m k v) m)))
           m))))))

;;; Alternative

(defn form-decode
  "Decode the supplied www-form-urlencoded string using the specified encoding,
  or UTF-8 by default. If the encoded value is a string, a string is returned.
  If the encoded value is a map of parameters, a map is returned."
  ([encoded]
   (form-decode encoded "UTF-8"))
  ([^String encoded encoding]
   (if-not (.contains encoded "=")
     (form-decode-str encoded encoding)
     (let [tokenizer (tokenizer encoded "&")]
       (loop [m {}]
         (if-let [^String param (next-token tokenizer nil)]
           (let [^StringTokenizer param-tokenizer (tokenizer param "=")
                 k (if (.startsWith param "=")
                     ""
                     (next-token param-tokenizer ""))
                 k (form-decode-str k encoding)
                 v (form-decode-str (next-token param-tokenizer "") encoding)]
             (recur (if (and k v) (assoc-conj m k v) m)))
           m))))))

If you find a better name for param-tokenizer binding the key could be compressed to a line under 80 characters, too.

  1. Reduce externally, keep old function body (defeats the purpose in my opinion, as the highest cost we pay here is over regular expressions)
(defn form-decode
  "Decode the supplied www-form-urlencoded string using the specified encoding,
  or UTF-8 by default. If the encoded value is a string, a string is returned.
  If the encoded value is a map of parameters, a map is returned."
  ([encoded]
   (form-decode encoded "UTF-8"))
  ([^String encoded encoding]
   (if-not (.contains encoded "=")
     (form-decode-str encoded encoding)
     (reduce
      (fn [m param]
        (let [[k v] (str/split param #"=" 2)
              k     (form-decode-str k encoding)
              v     (form-decode-str (or v "") encoding)]
          (if (and k v)
            (assoc-conj m k v)
            m)))
      {}
      (tokenized encoded "&")))))

NOTE:

Performance differences for reified reducible tokenizer: (solution 1)

;;; Case                            Original                 PR                       Reducible + tokenizer    Reducible + regex
;;; foo                             16.165265103567577 ns -> 18.071835515056534 ns -> 17.320719336705146 ns -> 17.604696821988536 ns
;;; a=b                             281.7666082195549  ns -> 85.34618203356219  ns -> 100.27655876859055 ns -> 225.0079659412358  ns
;;; a=b&c=d                         520.3155533101125  ns -> 192.88535391228297 ns -> 231.03698842087675 ns -> 492.6401809038521  ns
;;; foo+bar                         28.89783207896534  ns -> 33.96275410375853  ns -> 29.902639183635433 ns -> 32.411155123899185 ns
;;; a=b+c                           288.6556301231446  ns -> 100.12228382735458 ns -> 113.89209010375781 ns -> 223.6006611409361  ns
;;; a=b%2Fc                         323.94066534536466 ns -> 129.9906610438482  ns -> 136.2942337242038  ns -> 292.5329924798438  ns
;;; a=b&c                           470.7898489852522  ns -> 209.38083310542123 ns -> 210.61152438110452 ns -> 391.77510361071813 ns
;;; a=&b=c                          561.700251137112   ns -> 204.26021604451992 ns -> 224.44069615994098 ns -> 435.42927531090487 ns
;;; a&b=c                           481.1326203353287  ns -> 198.07185181140767 ns -> 220.36356595037435 ns -> 394.3551788952452  ns
;;; =                               252.35371110995692 ns -> 72.2746499966358   ns -> 83.88891707876553  ns -> 195.03739058045346 ns
;;; a=                              263.9456664837952  ns -> 84.74253809189693  ns -> 98.03556783118877  ns -> 212.27128534369646 ns
;;; =b                              262.19896201858967 ns -> 85.57113803290105  ns -> 98.35710802210014  ns -> 197.75605310148035 ns
;;; a=1&b=2&c=3&d=4                 1.0210454463867895 µs -> 509.19558470486334 ns -> 529.099610694873   ns -> 950.2181940488202  ns
;;; a=1&b=2&c=3&d=4&x=1&y=2&z=3&u=4 2.0973807618522025 µs -> 1.166763297769531  µs -> 1.1895777987654739 µs -> 2.1154829021936083 µs

;;; aaaaaaaaaaaaa=111111111111111&bbbbbbbbbbbbbbb=2222222222222
;;; 840.1896265647041  ns -> 471.0025293913937  ns -> 481.37886251787927 ns -> 726.6601651755639 ns

We can consistently see a slight overhead, so in terms of performance there's no great impact. I don't believe the alternative implementation improves readability, though.

What do you think?

Edit: Added results for 3 for completeness, as predicted, almost no improvement

weavejester commented 3 years ago

Regarding the inner tokenizer, perhaps we just use indexOf instead:

(defn- split-key-value-pair [^String s delim]
  (let [idx (.indexOf s delim)]
    (if (pos? idx)
      (clojure.lang.MapEntry. (.substring s 0 idx) (.substring s (inc idx))
      (clojure.lang.MapEntry. s s))))

Though I believe we need to force an integer argument:

(split-key-value-pair param #=(int \=))
bsless commented 3 years ago

I think we have a winner. Besides the small overhead of tokenized, the split it actually faster, especially for longer strings:

(let [ch (int \=)
      _0 (int 0)]
  (defn- split-key-value-pair [^String s]
    (let [idx (.indexOf s ch)
          jdx (unchecked-inc-int idx)]
      (if (neg? idx)
        (MapEntry. s "")
        (if (zero? idx)
          (MapEntry. "" (.substring s jdx))
          (MapEntry. (.substring s _0 idx) (.substring s jdx)))))))

(defn form-decode
  "Decode the supplied www-form-urlencoded string using the specified encoding,
  or UTF-8 by default. If the encoded value is a string, a string is returned.
  If the encoded value is a map of parameters, a map is returned."
  ([encoded]
   (form-decode encoded "UTF-8"))
  ([^String encoded encoding]
   (if-not (.contains encoded "=")
     (form-decode-str encoded encoding)
     (reduce
      (fn [m param]
        (let [kv (split-key-value-pair param)
              k  (form-decode-str (key kv) encoding)
              v  (form-decode-str (val kv) encoding)]
          (if (and k v)
            (assoc-conj m k v)
            m)))
      {}
      (tokenized encoded "&")))))
weavejester commented 3 years ago

That looks fine - though I'm confused by the use of _0 over 0, and ch over #=(int \=).

Also, how much performance do we actually gain using unchecked-inc-int?

bsless commented 3 years ago

I should have benchmarked it. I don't see any noticeable speedups, and if anything the simpler version is faster. Also, I see no effect to using unchecked-inc. Let's go for the more readable option

(defn- split-key-value-pair [^String s]
  (let [i (.indexOf s #=(int \=))
        j (inc i)]
    (if (pos? i)
      (MapEntry. (.substring s 0 i) (.substring s j))
      (if (zero? i)
        (MapEntry. "" (.substring s j))
        (MapEntry. s "")))))
weavejester commented 3 years ago

What about using cond and inline the j, as I don't think it factoring it out helps clarity:

(defn- split-key-value-pair [^String s]
  (let [i (.indexOf s #=(int \=))]
    (cond
      (pos? i)  (MapEntry. (.substring s 0 i) (.substring s (inc i)))
      (zero? i) (MapEntry. "" (.substring s (inc i)))
      :else     (MapEntry. s "")))))
bsless commented 3 years ago

I don't mind, just biased against cond for not having a default condition like case does. Pushed the change

bsless commented 3 years ago

@weavejester anything else left to do? Are the PR and commit messages alright?

bsless commented 3 years ago

@weavejester done requested changes here and in ring-codec

weavejester commented 3 years ago

It looks like the commit message hasn't been altered yet?

bsless commented 3 years ago

@weavejester I apologize, I missed it. Done as well.

bsless commented 3 years ago

@weavejester done requested change and changed PR title as well to align with the commit message