Closed bsless closed 3 years ago
What did you benchmark this change to be in comparison to the existing code? How does it compare to splitting on a string? What happens if a tokenizer is used onto to split on &
and not =
?
a=1&b=2&c=3&d=4&e=5
. I don't mind rerunning the benchmarks and sharing the results here. If there are some pathological cases which you think might perform badly I would like to bench them as well. I also used clj-async-profiler to generate flame graphs.str/split
also allocates a vector which is later traversed by reduce. Iterating over an array with indexed access would have been faster. Thus, the implementation avoids regular expressions, dispatching via protocols, destructuring and some allocations.I had it in my head that String.split
had an overload for splitting on a char
(and therefore avoiding the overhead of a regex), but I was thinking of String.replace
. Sorry for the confusion.
I'd appreciate the benchmark results, as it helps document the decisions made in the project. We can trace a commit back to a PR, and from the PR we can see how much impact a particular performance change made.
Apologies for the delay. Here are the changes, presented nicely and reproducible:
(ns user
(:require
[ring.util.codec :refer [form-decode]]
[criterium.core :as cc]))
(def forms
["foo"
"a=b"
"a=b&c=d"
"foo+bar"
"a=b+c"
"a=b%2Fc"
"a=b&c"
"a=&b=c"
"a&b=c"
"="
"a="
"=b"
"a=1&b=2&c=3&d=4"
"a=1&b=2&c=3&d=4&x=1&y=2&z=3&u=4"
"aaaaaaaaaaaaa=111111111111111&bbbbbbbbbbbbbbb=2222222222222"])
(doseq [form forms]
(let [res (first (:mean (cc/quick-benchmark (form-decode form) {})))
[scale unit] (cc/scale-time res)
mean (* scale res)]
(println form mean unit)))
;;; foo 16.165265103567577 ns -> 18.071835515056534 ns
;;; a=b 281.7666082195549 ns -> 85.34618203356219 ns
;;; a=b&c=d 520.3155533101125 ns -> 192.88535391228297 ns
;;; foo+bar 28.89783207896534 ns -> 33.96275410375853 ns
;;; a=b+c 288.6556301231446 ns -> 100.12228382735458 ns
;;; a=b%2Fc 323.94066534536466 ns -> 129.9906610438482 ns
;;; a=b&c 470.7898489852522 ns -> 209.38083310542123 ns
;;; a=&b=c 561.700251137112 ns -> 204.26021604451992 ns
;;; a&b=c 481.1326203353287 ns -> 198.07185181140767 ns
;;; = 252.35371110995692 ns -> 72.2746499966358 ns
;;; a= 263.9456664837952 ns -> 84.74253809189693 ns
;;; =b 262.19896201858967 ns -> 85.57113803290105 ns
;;; a=1&b=2&c=3&d=4 1.0210454463867895 µs -> 509.19558470486334 ns
;;; a=1&b=2&c=3&d=4&x=1&y=2&z=3&u=4 2.0973807618522025 µs -> 1.166763297769531 µs
;;; aaaaaaaaaaaaa=111111111111111&bbbbbbbbbbbbbbb=2222222222222
;;; 840.1896265647041 ns -> 471.0025293913937 ns
@weavejester this one is also ready for review
Let's see if we can't factor out the tokenizer a little. What if we created a reducible:
(defn tokenized [s delim]
(reify clojure.lang.IReduceInit
(reduce [_ f init]
(let [tokenizer (StringTokenizer. s delim)]
(loop [result init]
(if (.hasMoreTokens tokenizer)
(recur (f result (.nextToken tokenizer)))
result))))))
Then we can just replace (str/split encoded #"&")
with (tokenized encoded "&")
in the original code.
This would only address the outer tokenizer loop, though. Pros:
reduce
reduced?
every iterationI don't find tokenizers and interop too noisy today. If the goal is enhancing readability, there are a few alternatives:
(defn form-decode
"Decode the supplied www-form-urlencoded string using the specified encoding,
or UTF-8 by default. If the encoded value is a string, a string is returned.
If the encoded value is a map of parameters, a map is returned."
([encoded]
(form-decode encoded "UTF-8"))
([^String encoded encoding]
(if-not (.contains encoded "=")
(form-decode-str encoded encoding)
(reduce
(fn [m param]
(let [param-tokenizer (StringTokenizer. param "=")
k (if (.startsWith param "=")
""
(if (.hasMoreTokens param-tokenizer)
(.nextToken param-tokenizer)
""))
v (if (.hasMoreTokens param-tokenizer) (.nextToken param-tokenizer) "")
k (form-decode-str k encoding)
v (form-decode-str v encoding)]
(if (and k v)
(assoc-conj m k v)
m)))
{}
(tokenized encoded #"&")))))
(definline tokenizer
[^String s ^String token]
`(StringTokenizer. ~s ~token))
(definline has-more?
[^StringTokenizer st]
`(.hasMoreTokens ~st))
(defn next-token
{:inline
(fn
([st] `(.nextToken ~st))
([st nf] `(if (has-more? ~st) (.nextToken ~st) ~nf)))
:inline-arities #{1 2}}
([^StringTokenizer st]
(.nextToken st))
([^StringTokenizer st nf]
(if (has-more? st) (.nextToken st) nf)))
(defn form-decode
"Decode the supplied www-form-urlencoded string using the specified encoding,
or UTF-8 by default. If the encoded value is a string, a string is returned.
If the encoded value is a map of parameters, a map is returned."
([encoded]
(form-decode encoded "UTF-8"))
([^String encoded encoding]
(if-not (.contains encoded "=")
(form-decode-str encoded encoding)
(let [tokenizer (tokenizer encoded "&")]
(loop [m {}]
(if (has-more? tokenizer)
(let [^String param (next-token tokenizer)
^StringTokenizer param-tokenizer (tokenizer param "=")
k (if (.startsWith param "=")
""
(next-token param-tokenizer ""))
k (form-decode-str k encoding)
v (form-decode-str (next-token param-tokenizer "") encoding)]
(recur (if (and k v) (assoc-conj m k v) m)))
m))))))
;;; Alternative
(defn form-decode
"Decode the supplied www-form-urlencoded string using the specified encoding,
or UTF-8 by default. If the encoded value is a string, a string is returned.
If the encoded value is a map of parameters, a map is returned."
([encoded]
(form-decode encoded "UTF-8"))
([^String encoded encoding]
(if-not (.contains encoded "=")
(form-decode-str encoded encoding)
(let [tokenizer (tokenizer encoded "&")]
(loop [m {}]
(if-let [^String param (next-token tokenizer nil)]
(let [^StringTokenizer param-tokenizer (tokenizer param "=")
k (if (.startsWith param "=")
""
(next-token param-tokenizer ""))
k (form-decode-str k encoding)
v (form-decode-str (next-token param-tokenizer "") encoding)]
(recur (if (and k v) (assoc-conj m k v) m)))
m))))))
If you find a better name for param-tokenizer
binding the key could be compressed to a line under 80 characters, too.
(defn form-decode
"Decode the supplied www-form-urlencoded string using the specified encoding,
or UTF-8 by default. If the encoded value is a string, a string is returned.
If the encoded value is a map of parameters, a map is returned."
([encoded]
(form-decode encoded "UTF-8"))
([^String encoded encoding]
(if-not (.contains encoded "=")
(form-decode-str encoded encoding)
(reduce
(fn [m param]
(let [[k v] (str/split param #"=" 2)
k (form-decode-str k encoding)
v (form-decode-str (or v "") encoding)]
(if (and k v)
(assoc-conj m k v)
m)))
{}
(tokenized encoded "&")))))
NOTE:
Performance differences for reified reducible tokenizer: (solution 1)
;;; Case Original PR Reducible + tokenizer Reducible + regex
;;; foo 16.165265103567577 ns -> 18.071835515056534 ns -> 17.320719336705146 ns -> 17.604696821988536 ns
;;; a=b 281.7666082195549 ns -> 85.34618203356219 ns -> 100.27655876859055 ns -> 225.0079659412358 ns
;;; a=b&c=d 520.3155533101125 ns -> 192.88535391228297 ns -> 231.03698842087675 ns -> 492.6401809038521 ns
;;; foo+bar 28.89783207896534 ns -> 33.96275410375853 ns -> 29.902639183635433 ns -> 32.411155123899185 ns
;;; a=b+c 288.6556301231446 ns -> 100.12228382735458 ns -> 113.89209010375781 ns -> 223.6006611409361 ns
;;; a=b%2Fc 323.94066534536466 ns -> 129.9906610438482 ns -> 136.2942337242038 ns -> 292.5329924798438 ns
;;; a=b&c 470.7898489852522 ns -> 209.38083310542123 ns -> 210.61152438110452 ns -> 391.77510361071813 ns
;;; a=&b=c 561.700251137112 ns -> 204.26021604451992 ns -> 224.44069615994098 ns -> 435.42927531090487 ns
;;; a&b=c 481.1326203353287 ns -> 198.07185181140767 ns -> 220.36356595037435 ns -> 394.3551788952452 ns
;;; = 252.35371110995692 ns -> 72.2746499966358 ns -> 83.88891707876553 ns -> 195.03739058045346 ns
;;; a= 263.9456664837952 ns -> 84.74253809189693 ns -> 98.03556783118877 ns -> 212.27128534369646 ns
;;; =b 262.19896201858967 ns -> 85.57113803290105 ns -> 98.35710802210014 ns -> 197.75605310148035 ns
;;; a=1&b=2&c=3&d=4 1.0210454463867895 µs -> 509.19558470486334 ns -> 529.099610694873 ns -> 950.2181940488202 ns
;;; a=1&b=2&c=3&d=4&x=1&y=2&z=3&u=4 2.0973807618522025 µs -> 1.166763297769531 µs -> 1.1895777987654739 µs -> 2.1154829021936083 µs
;;; aaaaaaaaaaaaa=111111111111111&bbbbbbbbbbbbbbb=2222222222222
;;; 840.1896265647041 ns -> 471.0025293913937 ns -> 481.37886251787927 ns -> 726.6601651755639 ns
We can consistently see a slight overhead, so in terms of performance there's no great impact. I don't believe the alternative implementation improves readability, though.
What do you think?
Edit: Added results for 3 for completeness, as predicted, almost no improvement
Regarding the inner tokenizer, perhaps we just use indexOf
instead:
(defn- split-key-value-pair [^String s delim]
(let [idx (.indexOf s delim)]
(if (pos? idx)
(clojure.lang.MapEntry. (.substring s 0 idx) (.substring s (inc idx))
(clojure.lang.MapEntry. s s))))
Though I believe we need to force an integer argument:
(split-key-value-pair param #=(int \=))
I think we have a winner. Besides the small overhead of tokenized, the split it actually faster, especially for longer strings:
(let [ch (int \=)
_0 (int 0)]
(defn- split-key-value-pair [^String s]
(let [idx (.indexOf s ch)
jdx (unchecked-inc-int idx)]
(if (neg? idx)
(MapEntry. s "")
(if (zero? idx)
(MapEntry. "" (.substring s jdx))
(MapEntry. (.substring s _0 idx) (.substring s jdx)))))))
(defn form-decode
"Decode the supplied www-form-urlencoded string using the specified encoding,
or UTF-8 by default. If the encoded value is a string, a string is returned.
If the encoded value is a map of parameters, a map is returned."
([encoded]
(form-decode encoded "UTF-8"))
([^String encoded encoding]
(if-not (.contains encoded "=")
(form-decode-str encoded encoding)
(reduce
(fn [m param]
(let [kv (split-key-value-pair param)
k (form-decode-str (key kv) encoding)
v (form-decode-str (val kv) encoding)]
(if (and k v)
(assoc-conj m k v)
m)))
{}
(tokenized encoded "&")))))
That looks fine - though I'm confused by the use of _0
over 0
, and ch
over #=(int \=)
.
Also, how much performance do we actually gain using unchecked-inc-int
?
I should have benchmarked it. I don't see any noticeable speedups, and if anything the simpler version is faster. Also, I see no effect to using unchecked-inc. Let's go for the more readable option
(defn- split-key-value-pair [^String s]
(let [i (.indexOf s #=(int \=))
j (inc i)]
(if (pos? i)
(MapEntry. (.substring s 0 i) (.substring s j))
(if (zero? i)
(MapEntry. "" (.substring s j))
(MapEntry. s "")))))
What about using cond
and inline the j
, as I don't think it factoring it out helps clarity:
(defn- split-key-value-pair [^String s]
(let [i (.indexOf s #=(int \=))]
(cond
(pos? i) (MapEntry. (.substring s 0 i) (.substring s (inc i)))
(zero? i) (MapEntry. "" (.substring s (inc i)))
:else (MapEntry. s "")))))
I don't mind, just biased against cond
for not having a default condition like case
does. Pushed the change
@weavejester anything else left to do? Are the PR and commit messages alright?
@weavejester done requested changes here and in ring-codec
It looks like the commit message hasn't been altered yet?
@weavejester I apologize, I missed it. Done as well.
@weavejester done requested change and changed PR title as well to align with the commit message
Improves performance by avoiding regular expressions and allocation of immutable collections where possible