open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
92 stars 26 forks source link

ORM for proto <-> PostgreSQL #643

Closed skearnes closed 2 years ago

skearnes commented 2 years ago
codecov[bot] commented 2 years ago

Codecov Report

Merging #643 (3b1d5e0) into main (5972a56) will increase coverage by 8.50%. The diff coverage is 99.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #643      +/-   ##
==========================================
+ Coverage   64.60%   73.11%   +8.50%     
==========================================
  Files          16       17       +1     
  Lines        1859     2462     +603     
  Branches      497      655     +158     
==========================================
+ Hits         1201     1800     +599     
- Misses        556      557       +1     
- Partials      102      105       +3     
Impacted Files Coverage Δ
ord_schema/orm/mappers.py 99.33% <99.33%> (ø)
ipendlet commented 2 years ago

Schema is shown in detail here:

ORD_PostgresDB

skearnes commented 2 years ago

Reviewed the parts of this commit including the ORM. Was able to inspect the DB using:

name = "testdb"
port = 5678
path = "/tmp/my_test_db"
postgres = Postgresql(name=name, port=port, base_dir=path)
engine = create_engine(postgres.url())
Base.metadata.create_all(engine)
with Session(engine) as session:
    session.add(from_proto(testprotodb))

This was accessible through a database manage using the username postgres.

The schema constructed from protobuffers seems over normalized in the current version. The mapping between different entities is not organized in a way that is intuitive for searching the database. Pressure for instance is broken into 7 different tables referencing back to reactionID. This is common for many of the instances of measurements.

I think that this could be improved through careful construction of relationships within the ORM mappers.

None of these comments affect the functionality which works seemingly in its entirety.

There are probably better words for this, but the issue is that the proto is organized from top to bottom and the DB is organized from bottom to top. The proto definition allows for abstractions that are reused in multiple message types (like Amount), whereas the relational layout requires different types for each context in order to get the relationships to work (thus the joint table inheritance for some measure of polymorphism).

I'm open to suggestions about how to make better use of abstractions in the ORM, but this seemed like the best way to be faithful to the proto abstractions at the cost of extra complexity in the ORM. If I can find a good way to use e.g. with_polymorphic, then even if the underlying DB is more complex we won't see any of that complexity at the query construction level.