RFC: data structure - Githubissues

mvesper commented 9 years ago

The circulation module is going to be rewritten from scratch in Invenio 2, making it more flexible and covering more library use cases. The purpose of this RFC is to discuss current development stages of the data structure.

The currently demanded functionality seems to work with the following entities:

    +-----------------+                                                   
+-->|Record           |                                                   
|   +-----------------+                                                   
|   |Won't change this|                                                   
|   +-----------------+                                                   
|                                                                         
|                                                                         
|                                                                         
|   +--------------+          +--------------+            +--------------+
|   |Item          |   +----->|User/Librarian|<-+    +--->|Library       |
|   +--------------+   |      +--------------+  |    |    +--------------+
+---+record        |   |      |id            |  |    |    |id            |
    |id            |   |      |name          |  |    |    |name          |
    |barcode       |   |      |rights        |  |    |    |location      |
    |location      |   |      |loan_condition|  |    |    |loan_condition|
    |loan_condition|   |      +--------------+  |    |    +--------------+
    |status        |   |                        |    |                    
    |description   |   |                        |    |                    
    |notes         |   |                        |    |                    
    +--------------+   |                        |    |                    
               ^ ^     |                        |    |                    
               | |     |                        |    |                    
               | |     |                        |    |                    
    +-------+  | |     |      +-------+         |    |                    
    |Request|  | |     |      |Event  |         |    |                    
    +-------+  | |     |      +-------+         |    |                    
    |item   +--+ +------------+item   |         |    |                    
    |user   +----------+      |user   +---------+    |                    
    |status |<----------------+request|              |                    
    +-------+                 |library+--------------+                    
                              |date   |                                  
                              |action |                                  
                              +-------+

The class members are dummies and just there to provide a general idea about what kind of information each entity should carry. The precise database tables definitions will evolve over time.

Reasoning:

Item objects handle the physical (well, e-books...) entity of a record, they therefore only carry information about those objects. As a consequence, _Item_s don't know about their requests or their current holder.
User/Librarian objects provide little functionality on their own, they carry basic information about the person (maybe), their access rights and loan_condition.
Library objects manage specific loan_conditions (maybe/probably more).
Request objects handle the lifecycle of a loan, it therefore knows about the Item, the User/Librarian and the current status of the loan.
_Event_s keep track of every action, which most of the times means a status change in a Request or Item. Due to the Event objects, no other entity needs to keep track of their history (good idea?).
UPDATE 20.05.15

The general approach did not change, but now there is an idea how to store the data. After a discussion with @jalavik there is the following idea:

Instead of defining a database schema or class definition that tries to handle all different requirements (for example different locations, item details, barcodes) and naming conventions, the models basically carry two attributes: id and data (naming is open to discussion). The data attribute contains a string version of the object's attributes (something like the dict attribute). In order to make the individual values searchable, Elasticsearch will be used for indexing.

This basically means that MySQL and Elasticsearch are required in order to make this approach work, but, since the upcoming modular approach of Invenio would allow to just not install the circulation module if it is not needed, it should be ok.

Criticism and ideas are highly encouraged :)

tbasaglia commented 9 years ago

Sorry, I know that the class members are dummies, however:

There should be an additional aspect of the item, i.e. the support (ebook or paper)
Concerning 'User/Librarian': We just need users_categories instead of loan_condition' and 'rights'.

"no other entity needs to keep track of their history": I think so. However, what about records of paper books, for which we add the link to the ebook? In a sense, we add an item, so this should also be recorded as a status change in the history, even if it is basically a modification of a MARC field. So, if we decide that MARC should generate item information and not vice versa (decision to be taken!), modifications of 876 and 852 field should also trigger en event and status change (='item added').

tiborsimko commented 9 years ago

Instead of defining a database schema or class definition that tries to handle all different requirements (for example different locations, item details, barcodes) and naming conventions, the models basically carry two attributes: id and data (naming is open to discussion).

Architecturally, I'd say there are two extreme approaches: (1) introduce separate table column for each new property; (2) introduce only one "data" column and store every attribute there e.g. in a serialised JSON. The former seems close to the original proposal, the latter seems close to the updated proposal.

As it often happens in life, what about finding a middle road? By studying all the various use cases for circulation, we could extract common attributes (e.g. status) and create columns for them, all the while maintaining additional attributes (e.g. front page colour, or whatever an installation may want to store) in a free blob. Advantage: one could easily profit from SQL relational constraints (and SQL queries) for non-ID attributes, too. (Example: item's status column value would be always good, ensured by foreign key constraint.)

This basically means that MySQL and Elasticsearch are required in order to make this approach work

Not necessarily; PostgreSQL allows to combine relational and JSON data very efficiently. In this way we could profit from both worlds, having (structured) RDBM data for some attributes, and additional (free) JSON data for other attributes, all in one system.

inveniosoftware-attic / invenio-circulation-legacy

RFC: data structure #4

UPDATE 20.05.15