lilactown / pyramid

A library for storing and querying graph data in Clojure
Eclipse Public License 2.0
229 stars 10 forks source link

Prevent normalization of records #18

Closed eneroth closed 2 years ago

eneroth commented 2 years ago

Records behave both like maps and collections, but should probably be treated as opaque by Pyramid.

This prevents,

(pyramid/db [{:my/id  "hello"
              :a-uri  (lambdaisland.uri/uri "http://www.example.com")}])

from turning into,

{:my/id {"hello" {:my/id "hello",
                  :a-uri ([:fragment nil]
                          [:query nil]
                          [:path nil]
                          [:port nil]
                          [:host "www.example.com"]
                          [:password nil]
                          [:user nil]
                          [:scheme "http"])}}}

And instead returns,

{:my/id {"hello" {:my/id "hello"
                  :a-uri #lambdaisland/uri "http://www.example.com"}}}

This is in reference to https://github.com/lilactown/pyramid/issues/17.

phronmophobic commented 2 years ago

From a purely conceptual perspective, treating records as anything other than maps seems unidiomatic. If there’s an interface that describes the data, that might be reasonable, but an interface that is very specific to pyramid that is just making data less open and extensible seems like the wrong direction.

Maybe listing some specific use cases that might want this would be helpful.

eneroth commented 2 years ago

From a purely conceptual perspective, treating records as anything other than maps seems unidiomatic.

That would depend on what you mean by "unidiomatic". From Pyramid's perspective, @lilactown would ultimately have to decide what's idiomatic or not.

From Clojure's perspective, I'd argue that it's idiomatic. Records support the interface of maps because we don't want to write a new API every time we create a record. This doesn't mean that they are semantically interchangeable with maps in every circumstance.

Can we safely assume that every record we encounter will not be modelling some type or unit, where there are assumptions as to how the internals of the record looks?

Here's an example. Let's say I'm making a CRDT, and I'm modelling a range:

(defrecord Range [type begin end])

(Range. :bold {:id 0 :target/id 100} {:id 1 :target/id 102})

When normalizing this, Pyramid would effectively convert the record into a new record that looks like this:

(Range. :bold [:id 0] [:id 1])

Is this a good idea? I'm not sure, it depends on the expectations. One thing is for sure: the Range is no longer complying to the conventions I had in mind for them.

Regardless, the two scenarios are:

  1. Pyramid always normalizes records. I have no way of controlling this. If there's a problem, I'll have to fork Pyramid or avoid using it.
  2. Pyramid doesn't normalize the record. Now I have full control, and there's no need to fork/avoid Pyramid. Specifically, if I want Pyramid to normalize the record, I can just convert it to a map. If I don't want it normalized, I just leave it as it is.

In scenario 2, I have both options available to me. In scenario 1, I have no power over what happens.

There's potentially option 3 as well, as described by @lilactown here.

lilactown commented 2 years ago

@eneroth just to be clear, there is already a way to avoid normalizing of any data: you can change the "identify" function that the db is constructed with.

Example:

(require '[pyramid.core :as p])
(require '[pyramid.ident :as ident])

(def db (p/db [] (ident/by-keys :person/id :product/id)))

(defrecord Range [type begin end])

(def db1 (p/add db {:person/id 1 :person/friend {:person/id 2}}))
db1
;; => #:person{:id {2 #:person{:id 2}, 1 #:person{:id 1, :friend [:person/id 2]}}}

(def db2 (p/add db1 {:current-selection (->Range :select {:id 0 :target/id 100} {:id 1 :target/id 102})}))
db2
;; => {:person/id {2 #:person{:id 2}, 1 #:person{:id 1, :friend [:person/id 2]}},
;;     :current-selection
;;     {:type :select, :begin {:id 0, :target/id 100}, :end {:id 1, :target/id 102}}}

You can see here how this will only normalize information about :person/id and :product/id, and ignore :id and :target/id. That may help in your use case, I'm not sure.

With the latest change in master, I would expect that records should have their type preserved (sort of by accident) due to the refactor I did to use cascade.hike/walk for normalization.

I'll think more about opting out on a per-type or per-object basis - checking satisfies? every node in a tree may be expensive...

phronmophobic commented 2 years ago

When normalizing this, Pyramid would effectively convert the record into a new record that looks like this:

Nothing is being "converted". Whether or not you receive (Range. :bold {:id 0 :target/id 100} {:id 1 :target/id 102}) or (Range. :bold [:id 0] [:id 1]) doesn't depend on how the data is stored, but how the data is queried. Which result you get depends on your query and not on whether the entity was stored in a normalized form.

Is this a good idea? I'm not sure, it depends on the expectations. One thing is for sure: the Range is no longer complying to the conventions I had in mind for them.

This exact same argument applies to maps. {:type :bold :begin {:id 0 :target/id 100} :end {:id 1 :target/id 102}} and {:type :bold :begin [:id 0] :end [:id 1]} are also not interchangeable.

The main difference between normalizing and not normalizing is that if you don't normalize then 1) updating the entity with :id 0 would not update all instances of entities with :id 0 and 2) you can't run queries on the entities referenced by :start and :end without knowing which Range they belong to.

I agree that figuring out how to have queries that allow shallow denormalization or deep denormalization is a tough question. I don't see how normalization depends on records when all the same arguments also apply to maps.

Regardless, the two scenarios are:

Pyramid always normalizes records. I have no way of controlling this. If there's a problem, I'll have to fork Pyramid or avoid using it.
Pyramid doesn't normalize the record. Now I have full control, and there's no need to fork/avoid Pyramid. Specifically, if I want Pyramid to normalize the record, I can just convert it to a map. If I don't want it normalized, I just leave it as it is.

There are other options where whether or not a fully denormalized result is returned has nothing to do with whether or not a record was added to the db.

Just to show how other dbs address this problem, asami has a nested? flag when querying entities and replacement annotations for transactions. Datomic uses a schema to make this decision.

eneroth commented 2 years ago

Nothing is being "converted". Whether or not you receive (Range. :bold {:id 0 :target/id 100} {:id 1 :target/id 102}) or (Range. :bold [:id 0] [:id 1]) doesn't depend on how the data is stored, but how the data is queried. Which result you get depends on your query and not on whether the entity was stored in a normalized form.

I don't think that is correct, normalization happens when you're constructing or adding to the DB, and denormalization when you query it.

This exact same argument applies to maps.

True, but only relevant if the basic assumption is that maps and records are equal in all aspects and use cases. I understand that your perspective is that they are, but please understand that my perspective is that they are not.

(def entity
  (Range. :bold {:id 0 :target/id 100} {:id 1 :target/id 102}))

(= entity (into {} entity))
;; => false

99.99997% of the time we use maps (for good reason), and the rest of the time there's some use case where defrecord or deftype happens to make sense. My observation is that in practice, this happens when we want to make something "object-like."

I think the main point of view here boils down to a question of record duck-typing: if it has the interface of a map, should it in all cases be treated like a map?

Consider the original example where this started: I switched from using goog.URI and java.net.URI to https://github.com/lambdaisland/uri. To me it feels like an implementation detail that one uses records, and the others use coincidentally more opaque data types. The intention is the same: model what's effectively a value. That there should be a (potential) difference in the query I make to get it out of the DB seems odd from this perspective.

And yes, not being normalized, I couldn't expect the power that comes from normalization to apply here. Do I want all URIs to change database-wide, because I updated one of them? To me that's equivalent of asking if I want all instances of the number 9 to update when I update one instance. It comes down to what you consider to have boundaries and integrity or not. My contention is that,

  1. You can't know which interpretation is correct with records (is it intended to be structure-like or object-like?), but
  2. Records are rare and used in special circumstances, such as modelling the URI above, most of the time we're going to deal with maps, and therefore
  3. Defaulting to treating them as maps in all circumstances is not a good idea.
phronmophobic commented 2 years ago

I think the main point of view here boils down to a question of record duck-typing: if it has the interface of a map, should it in all cases be treated like a map?

Records aren't just map-like. Records are maps.

(defrecord Foo [])
(map? (->Foo)) ;; true

This isn't even duck-typing. Duck typing means that you can implicitly or incidentally conform to an interface if you implement methods with the same name in a shared, global namespace. Records explicitly and intentionally implement the map interface.

True, but only relevant if the basic assumption is that maps and records are equal in all aspects and use cases.

I do not assume this because it is not true. The different map implementations(eg. hash, array, record, etc) do differ. Some differences include performance and, in some cases, behavior. However, they all adhere to the map interface.

My observation is that in practice, this happens when we want to make something "object-like."

Records are "object-like" in the sense that they directly support polymorphism, but they are different from objects in most OO languages since they are immutable. It's true that a Record can't always be replaced with a map of the same keys and values, but that's also true for maps! Since maps can implement protocols via metadata, a map also can't be substituted for another map with different metadata in all cases.

Generally, Clojure is written in terms of abstractions.

  1. Do you think Pyramid should program against abstractions?
  2. If not, which concretions should it use and why?
  3. If you do think Pyramid should program against abstractions, do you think it should program against the map interface? I think it should.
  4. If not against the map interface, then which interface?

Do I want all URIs to change database-wide, because I updated one of them?

Why would that happen? If they refer to the same URI entity, then yes, they should change. If they refer to different URI entities, then only the entities that are the same should update.


You can't know which interpretation is correct with records (is it intended to be structure-like or object-like?), but

Why not? There are multiple options for achieving this.

Pyramid currently makes a default conventional assumption: your entities are identified by a keyword whose name is "id", e.g. :id, :person/id, :my.corp.product/id, etc.

For eample, Pyramid already has a convention for determining identity. For any record, you could simply assoc a keyword whose name is "id".


Having said that, I think there might be good reasons for not treating records the same as maps. Since records are relatively rare and one of the goals for Pyramid is read performance, it's totally reasonable to treat maps and records differently to meet those goals. It also might make sense to treat records opaquely just to simplify the codebase. However, treating maps differently from records because "records aren't maps" doesn't make sense because records are, in fact, maps. I think it's also totally plausible that it might actually simplify the code base and provide more leverage to utilize the fact that records are maps.

eneroth commented 2 years ago

Hello!

Sorry for the long bout of silence; I finally got that covid I've heard so much about and have been out of commission for a bit.

In the meantime, there seems to have been some changes in the main that has unblocked the main problem we had: lambdaisland/uri being butchered on normalization.

I think we've reached a point for pros and cons have been clearly spelled out, and I don't think we need to delve any further into this subject.

Thanks for the discussion, and feel free to reopen if I'm wrong.