metosin / malli

High-performance data-driven data specification library for Clojure/Script.
Eclipse Public License 2.0
1.5k stars 212 forks source link

malli.providers/provide takes "forever" for large samples #191

Closed luposlip closed 2 years ago

luposlip commented 4 years ago

Hi there,

I'm using Malli 0.0.1-SNAPSHOT.

I have 15 samples of approximately 200KB each (when saved as EDN to disk). Malli is unable to provide a schema for those - it seems to run forever. It would probably have stopped at some point, but I lost my patience.

I tried to run the provider for just one of those, it actually finished, but it took several minutes on my i7 w/6 physical cores (don't know if you have any parallelization yet).

Would it be possible in some way to speed this up, or at least provide some peek into the progress (via logging, async or similar)?

Looks promising though, will use Malli for smaller stuff for now.

Best regards, Henrik

ikitommi commented 4 years ago

The current impl is naive, running brute force for all registered schemas, throwing lot of exceptions on invalid types. It should be easy to make it many orders of magnitude faster with few simple tricks, including:

1) type based checking, e.g. a avalue of type Long (in JVM) should only be checked against numeric schemas 2) not throwing exceptions

There are other things that should be added to Providers: https://github.com/metosin/malli/issues/74

Most likely, won't have time to fix this any time soon, so contributions would be most welcome.

ikitommi commented 4 years ago

Did a quick test what happens. Here's a quick code that runs order of magntude faster (but is still dead-slow):

(require '[malli.core :as m])
(require '[malli.provider :as mp])
(require '[clojure.set :as set])
(require '[criterium.core :as cc])

(doseq [registry [;; 2700 µs
                  m/default-registry
                  ;;  450 µs
                  m/predicate-registry
                  ;;  230 µs
                  (-> m/predicate-registry
                      set/map-invert
                      set/map-invert
                      (dissoc 'empty?))]]
  (cc/quick-bench (mp/provide [1] {:registry registry})))

e.g. don't test the values against all registry entries, most of them fail with exception ([:map 1]), just test the predicates for values and only once (both int? and 'int? are registered), also, empty? throws on all non-seqs.

ikitommi commented 3 years ago
bsless commented 3 years ago

I think this idea was probably brought up in the past, but generation could be significantly sped up by failing faster. -fail uses ex-info which gets the stack trace and filters it. Much faster to (3-4x) to just throw a raw clojure.lang.ExceptionInfo, like so:

(defn fast-ex-info
  ([msg map]
   (clojure.lang.ExceptionInfo. msg map))
  ([msg map cause]
   (clojure.lang.ExceptionInfo. msg map cause)))

Even faster is to fail without a stack trace:

(ns malli.impl.Exception)

(gen-class
 :name malli.impl.Exception
 :extends java.lang.Exception)

(defn -fillInStackTrace
  ([_])
  ([_ _]))

Then

(require 'malli.impl.Exception)
(clojure.core/compile 'malli.impl.Exception)

(defn very-fast-ex-info
  ([msg map]
   (malli.impl.Exception. msg map)))

Another ~3x speedup These gains only benefit the JVM, but they're very significant. glaring issue: Overriding -fail and losing stack traces. Maybe add some configuration (global?) which selects fail-fast?

ikitommi commented 3 years ago

Thanks for giving a thought on this. I think adding a value class-based dispatch would enable a solution, that is 100-1000x faster, would not throw and would be extendable. In short:

  1. each IntoSchema can describe which values classes it accepts and a function to extract the actual schema from a value without throwing
  2. each registered Schema instance or AST is checked using m/validator
  3. mp/inferrer would be a function of registry -> values -> schema and would calculate 1 & 2 ahead of time.
  4. mp/infer would be a repl-friendly version of that
;; with defaults
(def infer (mp/inferrer))

(infer [1 2 3])
: => :int
bsless commented 3 years ago

That's a way better solution than my idea. My implementation just gives you a relatively easy speed up, while your suggestion requires some design and careful thought. You can still take it as an intermediary solution until a more robust and correct implementation is ready

ikitommi commented 3 years ago

Could you run a sample test with and without the change in exceptions? e.g. using the :perf profile:

(def samples
  [{:id "Lillan"
    :tags #{:artesan :coffee :hotel}
    :address {:street "Ahlmanintie 29"
              :city "Tampere"
              :zip 33100
              :lonlat [61.4858322, 23.7854658]}}
   {:id "Huber",
    :description "Beefy place"
    :tags #{:beef :wine :beer}
    :address {:street "Aleksis Kiven katu 13"
              :city "Tampere"
              :zip 33200
              :lonlat [61.4963599 23.7604916]}}])

(cc/quick-bench
  (mp/provide samples))

reorganized the code a bit, so now there is a mp/provider which caches most of the internals and is 5x faster with that test set. See #439

ikitommi commented 3 years ago

did a draft of the full (100x) fix too: https://github.com/metosin/malli/pull/440

ikitommi commented 3 years ago

Current status, 330x faster with a sample map than the original. Still one 10x to go :)

(def samples
  [{:id "Lillan"
    :tags #{:artesan :coffee :hotel}
    :address {:street "Ahlmanintie 29"
              :city "Tampere"
              :zip 33100
              :lonlat [61.4858322, 23.7854658]}}
   {:id "Huber",
    :description "Beefy place"
    :tags #{:beef :wine :beer}
    :address {:street "Aleksis Kiven katu 13"
              :city "Tampere"
              :zip 33200
              :lonlat [61.4963599 23.7604916]}}])

;; OLD: 126ms
(cc/quick-bench
  (mp/provide samples))

;; 0.6.2: 26ms (5x)
(let [provide (mp/provider)]
  (cc/quick-bench
    (provide samples)))

;; 0.7.0 (unreleased): 380µs (330x)
(let [provide (mp/provider)]
  (cc/quick-bench
    (provide samples)))
ikitommi commented 3 years ago

There is still a check for the inst? in clojure.core that uses satifies?, which is dead slow. Without it, here's the current upcoming perf. Still not good, but a 460x faster than the original:

(defn provider-test2 []
  (let [samples [{:id "Lillan"
                  :tags #{:artesan :coffee :hotel}
                  :address {:street "Ahlmanintie 29"
                            :city "Tampere"
                            :zip 33100
                            :lonlat [61.4858322, 23.7854658]}}
                 {:id "Huber",
                  :description "Beefy place"
                  :tags #{:beef :wine :beer}
                  :address {:street "Aleksis Kiven katu 13"
                            :city "Tampere"
                            :zip 33200
                            :lonlat [61.4963599 23.7604916]}}]]

    ;; 126ms -> 2.5ms
    (p/bench (mp/provide samples))

    ;; 26ms
    ;; 380µs, no exceptions, (330x)
    (let [provide (mp/provider)]
      (p/bench (provide samples)))

    ;; 270µs, no exceptions, no inst? (460x)
    (let [registry (dissoc (m/default-schemas) inst? 'inst?)
          provide (mp/provider {:registry registry})]
      (p/bench (provide samples)))))
ikitommi commented 2 years ago

[metosin/malli "0.7.0-20211031.202317-3"] has the perf fixes in.