tonsky / datascript

Immutable database and Datalog query engine for Clojure, ClojureScript and JS
Eclipse Public License 1.0
5.45k stars 304 forks source link

Allow non-numbers as entity ids #200

Open EugenDueck opened 7 years ago

EugenDueck commented 7 years ago

This has come up in a couple of issues before, both by developers being against it and for it. To wit: https://github.com/tonsky/datascript/pull/8#issue-32457706 https://github.com/tonsky/datascript/issues/56#issuecomment-73641241

Here's my case for a dynamic entity id type, rather than enforcing number?, as is done in the entid function.

I tried to integrate the high churn real-time data stream that my process receives and updates a in-memory map with. This map has something like the following, pretty common structure:

{ "key1"
  { "attrA" "foo", "attrB" "bar", ... }
 "key2"
  { "attrB" "foo", "attrX" "bar", ... }
}

It was easy enough to hook this map up to Datascript (Spossiba Tonsky, you ЯOCK! Datomic in contrast doesn't have published protocols...) by creating a MapToDatascript adapter, just implementing (partially, for the time being) IDB, ISearch, and IIndexAccess, and then query it like so, joining it with a real datascript DB:

(ds/q '[
    :find (pull $ ?e [*]) (pull $m ?attr-x [*])
    :in $ $m
    :where [$ ?e :attr-x ?attr-x]]
  @ds-db my-map-adapter)

Except, I had to tweak datascript's entid function - in fact simplifying it - in order to allow string eids, or just any type, for that matter.

Now I can of course create some number out of thin air and attach it to the entity to serve as the entity id, but that is another level of mapping that has to go on on every change of my data, which is thousands per second. What's worse, I now have another id that I need to keep consistent in the face of a machine going down and some failover machine taking over. That complicates things quite a bit.

So why not keep entity ids as flexible - for non-datascript data sources at least - as the datalog queries themselves, that don't care if your patterns are shaped [?e ?a ?v] or something completely different.

That way, integrating datascripts datalog engine with arbitrary data sources will be a lot easier, and depending on the case, more performant.

Is this a route you'd like to go in Datascript? Or is the datascript data model you want other sources to adhere to? In the former case, I'd have a more closed look at the sources and wrap this in a proper pull request, after checking and testing all the ramifications.

tonsky commented 7 years ago

Integer ids were needed because they’re much more performant than strings. In fact, it’s not that important that they are integers, rather, that they every id is comparable to any other id in a database.

For non-datascript datasources, there should be no limitation to what id should look like. E.g. in queries, if you add something different from DataScript DB, there should be no entity lookups, asserts for integer etc. If you’re seeing this, please report, and I’ll probably fix it.

EugenDueck commented 7 years ago

Here's one such case, using version 0.15.5.

First a query that works:

datascript-test.core=> (ds/q '[:find ?e ?a ?v :where [?e ?a ?v]] ds-source)
#{["a" "x" 1234]}

And now a similar query that does not work, involving a pull - where the problem seems to arise:

datascript-test.core=> (ds/q '[:find (pull ?e [*]) :where [?e]] ds-source)

ExceptionInfo Expected number or lookup ref for entity id, got "a"  clojure.core/ex-info (core.clj:4617)
datascript-test.core=> (pprint *e)
#error {
 :cause "Expected number or lookup ref for entity id, got \"a\""
 :data {:error :entity-id/syntax, :entity-id "a"}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message "Expected number or lookup ref for entity id, got \"a\""
   :data {:error :entity-id/syntax, :entity-id "a"}
   :at [clojure.core$ex_info invokeStatic "core.clj" 4617]}]
 :trace
 [[clojure.core$ex_info invokeStatic "core.clj" 4617]
  [clojure.core$ex_info invoke "core.clj" 4617]
  [datascript.db$entid invokeStatic "db.cljc" 748]
  [datascript.db$entid invoke "db.cljc" 727]
  [datascript.db$entid_strict invokeStatic "db.cljc" 754]
  [datascript.db$entid_strict invoke "db.cljc" 753]
  [datascript.pull_api$pull_spec$fn__1307 invoke "pull_api.cljc" 268]
  [clojure.core$map$fn__4781$fn__4782 invoke "core.clj" 2633]
  [clojure.lang.PersistentVector reduce "PersistentVector.java" 341]
  [clojure.core$transduce invokeStatic "core.clj" 6600]
  [clojure.core$into invokeStatic "core.clj" 6614]
  [clojure.core$into invoke "core.clj" 6604]
  [datascript.pull_api$pull_spec invokeStatic "pull_api.cljc" 268]
  [datascript.pull_api$pull_spec invoke "pull_api.cljc" 266]
  [datascript.query$pull$iter__3075__3079$fn__3080$fn__3086 invoke "query.cljc" 740]
  [clojure.core$map$fn__4789 invoke "core.clj" 2651]
  [clojure.lang.LazySeq sval "LazySeq.java" 40]
  [clojure.lang.LazySeq seq "LazySeq.java" 49]
  [clojure.lang.RT seq "RT.java" 521]
  [clojure.core$seq__4357 invokeStatic "core.clj" 137]
  [clojure.core.protocols$seq_reduce invokeStatic "protocols.clj" 24]
  [clojure.core.protocols$fn__6738 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__6738 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__6684$G__6679__6697 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6545]
  [clojure.core$into invokeStatic "core.clj" 6610]
  [clojure.core$mapv invokeStatic "core.clj" 6618]
  [clojure.core$mapv invoke "core.clj" 6618]
  [datascript.query$pull$iter__3075__3079$fn__3080 invoke "query.cljc" 737]
  [clojure.lang.LazySeq sval "LazySeq.java" 40]
  [clojure.lang.LazySeq seq "LazySeq.java" 49]
  [clojure.lang.RT seq "RT.java" 521]
  ...

Here's the code preceding above queries. I removed everything not involved in this particular case, including stuff actually necessary for most other queries, like indexed access etc.

(require [datascript.core :as ds])
(require [datascript.db :as dsdb])

(deftype AnyOldMapToDatascriptSource [data]
        dsdb/IDB

        dsdb/ISearch
        (-search [db [p-e p-a p-v :as pattern]]
          (->> data
               ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
               ;; filter entities
               ((fn e-filter [data]
                  (if (nil? p-e)
                    data
                    (list [p-e (data p-e)]))))
               ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
               ;; filter attributes
               ((fn [data] (if (nil? p-a)
                             data
                             (remove nil?
                                     (map (fn attr-filter-mapper [[k a-v-map]]
                                            (if-let [v (a-v-map p-a)]
                                              (if (or (nil? p-v) (= p-v v))
                                                [k {p-a v}])))
                                          data)))))
               ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
               ;; format for datascript
               (mapcat (fn mapcatter [[e m]] (map (fn datomizer [[a v]] (dsdb/datom e a v 1 true)) m)))))

        dsdb/IIndexAccess
      )

(def data { "a" { "x" 1234 }})
(def ds-source (AnyOldMapToDatascriptSource. data))
tonsky commented 7 years ago

So pull isn’t working? Heh, I’m not sure whether it should or not... Entity, pull and similar APIs were not meant for pluggable storages. The query engine was, it (should) be able to filter arbitrary collections

EugenDueck commented 7 years ago

I haven't looked at all that's involved here yet, but from an outsider's perspective I'd say (repeating myself): Why arbitrarily restrict the type? I haven't checked if it breaks something, but not raising an exception in the entid function fixes this case for me. If my change would make pull and entity slower, I'd of course understand that entid wants to enforce number?.

wbrown-lg commented 7 years ago

I can attest to the query engine working well when the IDB and ISearch protocols are implemented. However, in that case, it does one pattern at a time, which can be extremely slow and inefficient. I ended up writing my own query-engine that used some of the datascript.query and datascript.parser functions. But maybe the right way would be to have another layer in between. Hmm.

entid being allowed to be other than a number would make my use-case much easier, as I wrote a binary identity-to-long translator.

(defn insert-identity
  [m]
  (let [signed-identity (hash->long (hash-object (murmur3-128 1234) nippy-funnel
                                                 (dissoc m :db/id)))
        unsigned-identity (if (bit-test signed-identity 63)
                            (bit-not signed-identity)
                            signed-identity)]
    (assoc m :db/id unsigned-identity)))

This works, but I have nightmares about collisions with my 63-bit unsigned long EIDs, and I haven't worked out the CLJ-to-CLJS story yet. When I transit-encode Longs, I end up getting goog.math.Longs. Any possibility that eids can be allowed to be goog.math.Longs? That may be a worthy project.

wbrown commented 7 years ago

I was thinking about this some more, while figuring out how to write a general-purpose layer where we provide a schema of attributes to e-a-v-tx mappings. eids being integer only is a specific optimization for BTSets. Once a pattern returns tuples, anything above this shouldn't care about what type the eids are. I know that the hash-join and the collect code doesn't care.

So, if we are creating Datoms, or persisting to BTSets, then certainly, we should enforce the numeric type restriction. But for other data sources, I don't think we should.

If we permit relax the eid constraint this way, it makes it much easier for us to write IDB and ISearch adapters against various backends. I would really like the ability to use pull syntax against these adapters.

filipesilva commented 3 years ago

@tonsky you mentioned:

Integer ids were needed because they’re much more performant than strings. In fact, it’s not that important that they are integers, rather, that they every id is comparable to any other id in a database.

But on DataScript 1.0.0+ it is possible to compare attr values across types. Do you think now that https://github.com/tonsky/datascript/pull/388 is in we could use it for the db id comparator as well, allowing other types?

tonsky commented 3 years ago

There might be problems with API compatibility. Right now both integers, strings and keywords have special meaning during transaction. If we allow them as ids, this would need to be changed, which is hard to do in a backward-compatible way.

Honestly, I don’t see much problem with integer ids: treat them as internal db ids, like rowid in SQLite; use custom :db.unique/identity attribute for domain id. Would that work?

filipesilva commented 3 years ago

That's what we (roam research) are doing right now on our codebase, and it works well. Just came across this issue again when thinking about performance improvements and was wondering about the comparability.

tonsky commented 3 years ago

I think any extra conditionals in such a hot place (index does LOTS of comparisons) would be harmful. But I see the connection you saw with #388, ideally, everything could go through similar procedure. For performance, though, it might not be optimal