One of the first things we need to decide is whether we want to go Graph, Relational, Hybrid (azlib), or noSQL. Here is a scenario for handling references under a hybrid model, with pros and cons.
Pros:
Represent the references not as a table, but as a jsonb entry within the table (see below)
Very easy to do azlib style lineage/version tracking and rollback
Cons:
We lose the feature from PBot where we linked Persons to References. However, this may actually be a pro (see below)
A hybrid model, as we've seen in AZLib can be tricky if not done correctly to make sure there is no duplication or de-synch of the json and tables. I think we now know enough about this that we know how to prevent this, but still it is a an additional complexity so I list it as a con.
Unknowns:
I am not sure how performant the JSON search will be at scale. PBDB has far more entries than AZLib.
Example
Okay, so here is how I envision this scenario operating. The current references table looks like the following.
Notice that this table is very flattened. Instead, I think the table should look like the following.
CREATE TABLE refs (
references_id integer GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
reference_no integer NOT NULL, -- This is a permanent id that will stay constant regardless of version
reference_type_no integer REFERENCES dictionaries.reference_types("reference_type_no"),
authorizer_no integer REFERENCES person("person_no") NOT NULL, -- Whoever was the authorizer of the enterer
enterer_no integer REFERENCES person("person_no") NOT NULL, -- Whoever made the entry or edit
citation jsonb, -- actual bibjson of the citation https://github.com/rdmpage/bibliographic-metadata-json
preceded_by integer REFERENCES refs("references_id"),
succeeded_by integer REFERENCES refs("references_id"),
removed boolean
);
As can be seen here, it becomes super easy to for us to track who made what changes when and to roll back if necessary. This is much better than the current database which tracks a modifier_no for the latest changes, but doesn't track what the changes were. Using bibjson to handle the citations is also dramatically cleaner than trying to flatten the citations as they've done in the past and also allows us to track more information like abstract. Bibliographic information is inherently hierarchical in a way that is a pain in the ass to normalize.
A "downside" is that we would lose the PBot concept of having every reference author in the Person table (i.e., relations between Person nodes and Reference nodes). This was a very cool and (I think) popular idea that was meant to facilitate quickly see each person's body of work, but I think we've run into quite a few problems in practice.
It slows down the process of adding references in PBot compared to PBDB because people have to create or search for all the authors as person nodes first instead of just putting in the name.
The PBDB references don't have a similar linking concept, which means we would have to manually go back and create thousands (or at least hundreds) of new person entries AND also creating the authored_by relationships.
Additional Con (or unknown):
We don't expose the AZLib upload API publicly, but our intention is to do so for PBDB.
Would it be confusing for users to have to use json as part of their upload instead of a table?
How difficult will it be to provide validation and checking that the JSON they pass us is not just valid JSON, but is valid bibjson (or whatever spec we end up using?).
Overview
One of the first things we need to decide is whether we want to go Graph, Relational, Hybrid (azlib), or noSQL. Here is a scenario for handling references under a hybrid model, with pros and cons.
Pros:
Cons:
Unknowns:
Example
Okay, so here is how I envision this scenario operating. The current references table looks like the following.
Notice that this table is very flattened. Instead, I think the table should look like the following.
As can be seen here, it becomes super easy to for us to track who made what changes when and to roll back if necessary. This is much better than the current database which tracks a
modifier_no
for the latest changes, but doesn't track what the changes were. Using bibjson to handle the citations is also dramatically cleaner than trying to flatten the citations as they've done in the past and also allows us to track more information like abstract. Bibliographic information is inherently hierarchical in a way that is a pain in the ass to normalize.A "downside" is that we would lose the PBot concept of having every reference author in the Person table (i.e., relations between Person nodes and Reference nodes). This was a very cool and (I think) popular idea that was meant to facilitate quickly see each person's body of work, but I think we've run into quite a few problems in practice.