paleobot / pbdb2-dev

Initial design documents for pbdb 2.0
MIT License
0 stars 0 forks source link

An alternative model for references and implications #2

Open aazaff opened 2 weeks ago

aazaff commented 2 weeks ago

Overview

One of the first things we need to decide is whether we want to go Graph, Relational, Hybrid (azlib), or noSQL. Here is a scenario for handling references under a hybrid model, with pros and cons.

Pros:

  1. Represent the references not as a table, but as a jsonb entry within the table (see below)
  2. Very easy to do azlib style lineage/version tracking and rollback

Cons:

  1. We lose the feature from PBot where we linked Persons to References. However, this may actually be a pro (see below)
  2. A hybrid model, as we've seen in AZLib can be tricky if not done correctly to make sure there is no duplication or de-synch of the json and tables. I think we now know enough about this that we know how to prevent this, but still it is a an additional complexity so I list it as a con.

Unknowns:

  1. I am not sure how performant the JSON search will be at scale. PBDB has far more entries than AZLib.

Example

Okay, so here is how I envision this scenario operating. The current references table looks like the following.

CREATE TABLE `refs` (
  `authorizer` varchar(64) NOT NULL DEFAULT '',
  `enterer` varchar(64) NOT NULL DEFAULT '',
  `modifier` varchar(64) DEFAULT NULL,
  `authorizer_no` int(10) unsigned NOT NULL DEFAULT 0,
  `enterer_no` int(10) unsigned NOT NULL DEFAULT 0,
  `modifier_no` int(10) unsigned NOT NULL DEFAULT 0,
  `reference_no` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `author1init` varchar(10) DEFAULT NULL,
  `author1last` varchar(255) NOT NULL DEFAULT '',
  `author2init` varchar(10) DEFAULT NULL,
  `author2last` varchar(255) DEFAULT NULL,
  `otherauthors` varchar(255) DEFAULT NULL,
  `pubyr` varchar(4) NOT NULL DEFAULT '',
  `reftitle` mediumtext DEFAULT NULL,
  `pubtitle` mediumtext DEFAULT NULL,
  `editors` varchar(255) DEFAULT NULL,
  `publisher` varchar(255) DEFAULT NULL,
  `pubcity` varchar(80) DEFAULT NULL,
  `pubvol` varchar(10) DEFAULT NULL,
  `pubno` varchar(10) DEFAULT NULL,
  `firstpage` varchar(10) DEFAULT NULL,
  `lastpage` varchar(10) DEFAULT NULL,
  `upload` enum('','YES') DEFAULT NULL,
  `created` datetime NOT NULL DEFAULT current_timestamp(),
  `created_backup` datetime DEFAULT NULL,
  `modified` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
  `publication_type` enum('journal article','book','book chapter','book/book chapter','serial monograph','compendium','Ph.D. thesis','M.S. thesis','abstract','guidebook','news article','unpublished') DEFAULT NULL,
  `classification_quality` enum('authoritative','standard','compendium') NOT NULL DEFAULT 'standard',
  `basis` enum('','stated with evidence','stated without evidence','second hand','none discussed','not entered') DEFAULT NULL,
  `language` enum('Chinese','English','French','German','Italian','Japanese','Portugese','Russian','Spanish','other','unknown') DEFAULT NULL,
  `doi` varchar(80) DEFAULT NULL,
  `comments` mediumtext DEFAULT NULL,
  `project_name` set('decapod','ETE','5%','1%','PACED','PGAP','fossil record') DEFAULT NULL,
  `project_name_backup` varchar(32) NOT NULL DEFAULT '',
  `project_ref_no` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`reference_no`),
  KEY `author1last` (`author1last`),
  KEY `pubyr` (`pubyr`),
  KEY `authorizer_no` (`authorizer_no`),
  KEY `enterer_no` (`enterer_no`),
  KEY `modifier_no` (`modifier_no`),
  KEY `modified` (`modified`),
  FULLTEXT KEY `reftitle` (`reftitle`),
  FULLTEXT KEY `pubtitle` (`pubtitle`)
) ENGINE=InnoDB AUTO_INCREMENT=89282 DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci;

Notice that this table is very flattened. Instead, I think the table should look like the following.

 CREATE TABLE refs (
     references_id integer GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
     reference_no integer NOT NULL, -- This is a permanent id that will stay constant regardless of version
     reference_type_no integer REFERENCES dictionaries.reference_types("reference_type_no"),
     authorizer_no integer REFERENCES person("person_no") NOT NULL, -- Whoever was the authorizer of the enterer
     enterer_no integer REFERENCES person("person_no") NOT NULL, -- Whoever made the entry or edit
     citation jsonb, -- actual bibjson of the citation https://github.com/rdmpage/bibliographic-metadata-json
     preceded_by integer REFERENCES refs("references_id"),
     succeeded_by integer REFERENCES refs("references_id"),
     removed boolean
     );

As can be seen here, it becomes super easy to for us to track who made what changes when and to roll back if necessary. This is much better than the current database which tracks a modifier_no for the latest changes, but doesn't track what the changes were. Using bibjson to handle the citations is also dramatically cleaner than trying to flatten the citations as they've done in the past and also allows us to track more information like abstract. Bibliographic information is inherently hierarchical in a way that is a pain in the ass to normalize.

A "downside" is that we would lose the PBot concept of having every reference author in the Person table (i.e., relations between Person nodes and Reference nodes). This was a very cool and (I think) popular idea that was meant to facilitate quickly see each person's body of work, but I think we've run into quite a few problems in practice.

  1. We've seen a lot of annoying user account/person collisions in PBot (see, https://github.com/paleobot/pbot-dev/issues/297, https://github.com/paleobot/pbot-dev/issues/233)
  2. It slows down the process of adding references in PBot compared to PBDB because people have to create or search for all the authors as person nodes first instead of just putting in the name.
  3. The PBDB references don't have a similar linking concept, which means we would have to manually go back and create thousands (or at least hundreds) of new person entries AND also creating the authored_by relationships.
aazaff commented 2 weeks ago

Additional Con (or unknown): We don't expose the AZLib upload API publicly, but our intention is to do so for PBDB.

  1. Would it be confusing for users to have to use json as part of their upload instead of a table?
  2. How difficult will it be to provide validation and checking that the JSON they pass us is not just valid JSON, but is valid bibjson (or whatever spec we end up using?).