ontodev / valve.rs

A lightweight validation engine written in rust.
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

ontodev_valve/valve.rs

Valve - A lightweight validation engine written in rust.

Table of contents

ToC Generated using markdown-toc

Design and concepts

Valve reads in the contents of user-defined data tables and represents them using a spreadsheet-like structure incorporating annotated cells that may subsequently be edited and saved. The annotations are of two kinds:

  1. Data validation messages indicating that the value of a particular cell violates one or more of a number of user-defined data validation rules.
  2. Data update messages indicating that the value of a particular cell has changed as well as the details of the change.

After reading and optionally validating the initial data, the validated information is loaded into a database that can then be queried directly using your favourite SQL client (e.g., sqlite3 or psql) as in the examples below. The currently supported databases are SQLite and PostgreSQL. Valve further provides an Application-programmer interface (API) that can be used to incorporate Valve's data validation and manipulation functionality and to visualize the validated data within your own rust project. For an example of a rust project that uses Valve, see ontodev/nanobot.rs. Finally, Valve also provides a Command line interface (CLI) for some of its more important functions like load, save, and guess.

The Valve database

Data tables and views

Below is an example of one of the views Valve provides into a data table in which three of the five rows contain invalid data, and one row has been edited since it was initially loaded. The output is from a psql session connected to a PostgreSQL database called 'valve_postgres'.

valve_postgres=> select * from table11_text_view
row_number row_order message history rank name employee_id
1 1000 Eastern division head Denise Laviolette denise_l
2 2000 Southwest division head Eddie Cahill eddie_c
3 3000 [{"column":"employee_id","value":"denise_l","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] [[{"column":"rank","level":"update","message":"Value changed from 'intern' to 'Chief executive officer'","old_value":"intern","value":"Chief executive officer"}]] Chief executive officer Denise Laurier denise_l c f
4 4000 [{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] Development team lead Edward Coulombe eddie_c
5 5000 [{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] Test team lead Ed Crantivera eddie_c
(5 rows)

Here, row_number is a fixed identifier assigned to each row at the time of creation, while row_order, which in principle might change multiple times throughout the lifetime of a given row, is used to manipulate the logical order of rows in the table. The message column, in this example, conveys that the value 'denise_l' of the column "employee_id" in row 3, and the value 'eddie_c' of the same column in rows 4 and 5, are duplicates of already existing values of that column, which happens to be a primary key, and are therefore invalid. We can also see, from the history column, that there has been one change, in which one of the values in row 3, namely the value of the column "rank", has been changed from 'intern' to 'Chief executive officer'. Finally, the names of the columns to the right of the history column correspond to the column names of the source table and will therefore vary from table to table. Normally these column names are specified in the header of a '.tsv' file from which the source data is read, though see below for alternate input data formats and associated table options. In any case the data contained in the columns to the right of the history column will exactly match the contents of the source table unless the data has been edited since it was initially loaded.

For the example below we will assume that a file named artists.tsv exists on your hard disk in your current working directory with the following contents:

name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix
The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
Cream band 3 Blue Cross 9388883311 XX54
Jennifer Lopez solo Medi-Assist MA67886666881
Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
Chrissie Hynde solo Blue Cross 4422393877
Van Halen band 5 Blue Cross 9476587117 BBDC
Van Morrison solo Medi-Assist MA67920004571
Paul McCartney solo Medi-Assisr MA60768763987
The Band band five Blue Cross 0831133887
Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954
Van Halen band Blue Cr. 9476587117

We will also assume that a file named providers.tsv exists on your hard disk in your current working directory with the following contents:

name address
Blue Cross 123 Fake Street, Fake Town, USA, 55123
Medi-Assist 933 Phoney Boulevard, Accra, Ghana, GA008
Pittsfield Medical 510 North Street, Pittsfield, MA, 01201

In order for Valve to read this table it must first be configured to do so. This is done using a number of special data tables called configuration tables, usually represented using further '.tsv' files, that contain information about:

  1. Where to find the data tables and other general properties of each managed data table (i.e., the table table).
  2. The datatypes represented in the various data tables (i.e., the datatype table).
  3. Information about the columns of each data table: their associated datatypes, data dependencies between the values in one table column and some other one, and any other constraints or restrictions on individual columns (i.e., the column table).
  4. Rules constraining the joint values of two different cells in a single given row of data (i.e., the rule table).

For our example we will assume that Valve's configuration tables contain the following entries:

table path description type options
artists artists.tsv
providers providers.tsv
table column label nulltype default datatype structure description
artists name trimmed_line primary
artists type trimmed_line
artists number_of_members empty integer
artists health_insurance_provider trimmed_line from(providers.name)
artists health_insurance_id nonspace
artists health_insurance_id_suffix empty word
providers name trimmed_line primary
providers address text
datatype parent condition description sql_type
integer nonspace match(/-?\d+/) INTEGER
nonspace trimmed_line exclude(/\s/)
trimmed_line line match(/\S([^\n]\S)/)
line text exclude(/\n/)
word nonspace exclude(/\W/)
empty text equals('')
text TEXT
table when_column when_condition then_column then_condition level description
artists type equals(band) number_of_members not null error a band must specify the number of members
artists health_insurance_provider equals('Blue Cross') health_insurance_id_suffix not null error a health insurance id suffix is required for Blue Cross members
artists health_insurance_provider equals('Pittsfield Medical') health_insurance_id word error a Pittsfield Medical ID must consist in a single word

For the meanings of all of the columns in the configuration tables above, see the section on configuration. In the rest of this section we'll refrain from explaining the meaning of a particular configuration table column unless and until it becomes relevant to our example. What is relevant at this point is only that each configuration table is also a data table whose contents are themselves subject to validation by Valve. In other words Valve will not necessarily fail to run if there are errors in its configuration (as long as those errors aren't critical) and it can moreover help to identify what those errors are. The upshot is that almost everything I mention below regarding artists also applies to the special configuration tables table, column, datatype and rule unless otherwise noted.

Once it has been read in by Valve from its source file, a given logical table will be represented, in the database, by between one and two database tables and by as many as two database views. In our example, the source data, contained in the file 'artists.tsv', represents (according to the table table configuration) a normal data table with the default options set, which means that all four database tables and views will be created. These are:

  1. artists: The database table, having the same name as the logical table, that normally contains the bulk of the table's data.
  2. artists_conflict (only when the conflict option is set): Valve aims to represent user data whether or not that data is valid. When it is not valid, however, this presents an obstacle when it comes to representing it in a relational database like SQLite or PostgreSQL. For instance, if a table contains a primary key column, then data rows containing values for the column that already exist in the table should be marked as invalid, but still somehow represented by Valve. However the implied database constraint on the database table will prevent us from inserting duplicate values to a primary key column. A similar issue arises for unique and foreign key constraints. To get around the database limitations that are due to database constraints such as primary, unique, or foreign keys, Valve constructs a _conflict table that is identical to the normal version of the table, but that does not include those keys. In our example, any rows containing duplicate values of the primary key column will be inserted to the database table called artists_conflict, while the other rows will be inserted to artists.
  3. artists_view (only when the conflict option is set): When Valve validates a row from the logical table, artists, the validation messages it generates are not stored in the database table called artists but instead in a special internal database table called message. Similarly, when Valve adds, updates, or deletes a row from artists, a record of the change is not stored in artists, but in a special internal database table called history. Because it can be convenient for information about the validity of a particular cell to be presented side by side with the value of the cell itself, Valve constructs a table called artists_view which combines information from artists and artists_conflict with the message and history tables.
  4. artists_text_view (only when the conflict option is set): In addition to the restrictions associated with primary, unique, and foreign key constraints, the database will also not allow us to represent values that are invalid due to an incorrect SQL type, in particular when one attempts to insert a non-numeric string value to a numeric column of a database table. To get around this limitation, Valve constructs a database view called artists_text_view. Unlike artists_view, which is defined such that the datatype of each column in artists_view exactly matches the datatype of the corresponding datatype in artists, in artists_text_view all of the columns are cast to TEXT so that no SQL datatype errors can occur when representing the data.

After loading the data from artists.tsv into the database, these four tables and views will be found to have the following contents:

valve_postgres=> select * from artists;
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix
1 1000 The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
2 2000 Cream band 3 Blue Cross 9388883311 XX54
3 3000 Jennifer Lopez solo Medi-Assist MA67886666881
4 4000 Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
6 6000 Van Halen band 5 Blue Cross 9476587117 BBDC
7 7000 Van Morrison solo Medi-Assist MA67920004571
(6 rows)

valve_postgres=> select * from artists_conflict
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix
5 5000 Chrissie Hynde solo Blue Cross 4422393877
8 8000 Paul McCartney solo Medi-Assisr MA60768763987
9 9000 The Band band Blue Cross 0831133887
10 10000 Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954
11 11000 Van Halen band 5 Pittsfield Med. 9476587117
(5 rows)

valve_postgres=> select * from artists_view order by row_number;
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix message history
1 1000 The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
2 2000 Cream band 3 Blue Cross 9388883311 XX54
3 3000 Jennifer Lopez solo Medi-Assist MA67886666881
4 4000 Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
5 5000 Chrissie Hynde solo Blue Cross 4422393877 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}]
6 6000 Van Halen band 5 Blue Cross 9476587117 BBDC
7 7000 Van Morrison solo Medi-Assist MA67920004571
8 8000 Paul McCartney solo Medi-Assisr MA60768763987 [{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}]
9 9000 The Band band Blue Cross 0831133887 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}]
10 10000 Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954 [{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}]
11 11000 Van Halen band 5 Pittsfield Med. 9476587117 [{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}]
(11 rows)

valve_postgres=> select * from artists_text_view order by row_number;
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix message history
1 1000 The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
2 2000 Cream band 3 Blue Cross 9388883311 XX54
3 3000 Jennifer Lopez solo Medi-Assist MA67886666881
4 4000 Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
5 5000 Chrissie Hynde solo Blue Cross 4422393877 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}]
6 6000 Van Halen band 5 Blue Cross 9476587117 BBDC
7 7000 Van Morrison solo Medi-Assist MA67920004571
8 8000 Paul McCartney solo Medi-Assisr MA60768763987 [{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}]
9 9000 The Band band five Blue Cross 0831133887 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}]
10 10000 Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954 [{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}]
11 11000 Van Halen band 5 Pittsfield Med. 9476587117 [{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}]
(11 rows)

The message and history tables

In the previous section we mentioned that artists_view and artists_text_view are defined in terms of a query that draws on data from artists and artists_conflict as well as from the special internal tables called message and history. The information in these tables can also be queried directly. After loading the data from artists.tsv into the database, one will find that the message table contains the following contents:

valve_postgres=> select * from message where "table" = 'artists';
message_id table row column value level rule message
62 artists 5 health_insurance_provider Blue Cross error rule:health_insurance_provider-1 a health insurance id suffix must be specified for Blue Cross members
63 artists 8 health_insurance_provider Medi-Assisr error key:foreign Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name
64 artists 9 number_of_members five error datatype:integer number_of_members should be a positive or negative integer
65 artists 9 health_insurance_provider Blue Cross error rule:health_insurance_provider-1 a health insurance id suffix must be specified for Blue Cross members
66 artists 10 health_insurance_provider Pittsfield Medical error rule:health_insurance_provider-2 a Pittsfield Medical health insurance id must be a single word
67 artists 10 health_insurance_id FFF GYU ZKJ 954 error datatype:nonspace health_insurance_id should be text without whitespace
68 artists 11 name Van Halen error key:primary Values of name must be unique
69 artists 11 health_insurance_provider Pittsfield Med. error key:foreign Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name
(8 rows)

Here,

The history table will be empty immediately after the initial loading of the database.

valve_postgres=> select * from history where "table" = 'artists';
history_id table row from to summary user undone_by timestamp
(0 rows)

Here,

Note that the summary column of the history table is where the information in the view columns artists_view.history and artists_text_view.history is taken from, which is why these view columns are always in the form of an array of arrays, i.e., an array of summary records for the given row.

Data validation and editing

Representing validated data

For the purposes of this section we will define a cell as a single value, that can either be valid or invalid, of one of the columns of a given data table. We will further define a row of data, from a given data table, to be an array of correlated cells, such that for every column in the table there is one and only one cell in the row corresponding to it. Once a given row of data has been inserted into the database and (optionally) validated, Valve represents it using a ValveRow struct (see the API reference) containing the following information:

  1. The row's fixed, unique, identifier, or row_number.
  2. A map from the names of the columns contained in the row, to the ValveCell struct associated with each, where the latter represents the results of running Valve's validation engine (see the validation process) on the cell associated with that column.

In particular, each ValveCell contains the following information:

  1. The value of the cell.
  2. Whether the value should be interpreted as a null value of the given column, and if so, the column's nulltype.
  3. Whether or not the value is valid given the column's column configuration.
  4. A list of ValveCellMessage structs representing the validation messages associated with this cell.

Each ValveCellMessage, in turn, contains the following information:

  1. The level, or severity, of the message (e.g., "error", "warn", "info").
  2. An alphanumeric identifier for the rule violation described by the message (see the section on rule violation identifiers).
  3. The text of the message.

The following is a textual representation of an example of a ValveRow:

ValveRow {
    row_number: Some(
        11,
    ),
    contents: {
        "id": ValveCell {
            nulltype: None,
            value: String("BFO:0000027"),
            valid: true,
            messages: [],
        },
        "name": ValveCell {
            nulltype: None,
            value: String("Mike"),
            valid: true,
            messages: [],
        },
        "location": ValveCell {
            nulltype: None,
            value: String("baree"),
            valid: false,
            messages: [
                ValveCellMessage {
                    level: "error",
                    rule: "key:foreign",
                    message: "Value 'baree' of column location is not in cities.city_name",
                },
            ],
        },
        "preferred_seafood": ValveCell {
            nulltype: Some("empty"),
            value: String(""),
            valid: true,
            messages: [],
        },
    },
}
Rule violation identifiers

Valve uses the following to identify the rule that has been violated by a given cell value in the ValveCellMessage associated with the violation:

The validation process

Validating a row of data

For a given row of data in a given data table, Valve's validation process consists in a series of checks that are schematically represented in the flowchart below and subsequently explained in more detail.

flowchart TD
    node1["1. Determine the nulltype (if any) of each cell in the row"]
    node1 -- "Then, for each cell:" --> node2
    node2["2. Validate rules"]
    node2 --> modal1
    modal1{"Does the cell have a nulltype?"}
    modal1 -- No --> node3
    modal1 -- Yes, skip further validation for this cell --> modal3
    node3["3. Validate datatype"]
    node3 --> modal2
    modal2{"Does the cell value contain a SQL type error?"}
    modal2 -- No --> node4
    modal2 -- Yes --> modal3
    node4["4. Validate foreign constraints"]
    node4 --> node5
    node5["5. Validate primary and unique Constraints"]
    node5 --> modal3
    modal3{"Have we iterated over all of the cells in the row?"}
    modal3 -- Yes, then over the table as a whole: --> node6
    modal3 -- No, go on to the next cell --> node2
    node6["6. Validate tree-foreign keys"]
Determining the nulltype of a cell

The validation process begins by determining, for each cell in the row, whether the value of that cell matches the nulltype (if any) of its associated column, as defined in the column table. In particular, if the value of the cell matches the nulltype of its associated column, then the nulltype field of the ValveCell struct used to represent the cell will be set to indicate that the value is a null value of that type. Otherwise the nulltype field will remain unset, indicating that the value is not a null value. For instance, suppose that the cell value is '' (i.e., the empty string), and that the nulltype for its associated column, as defined in the column table, is empty. Since empty's associated condition, as defined in the datatype table is equals(''), applying it to the cell value will result in a match, and Valve will set the nulltype field for the ValveCell representing this particular cell to empty. In the case where the value of the cell does not match the condition associated with the datatype, empty, (i.e., when the cell value is something other than an empty string), the validation process will leave the nulltype field of the ValveCell unset.

Validating rules

This step of the validation process determines whether any of the rules in the rule table that are applicable to a cell have been violated. A rule in the rule table is applicable to a cell when the cell's associated column is the same as the when_column associated with the rule. Note that since the rules in the rule table correspond to if-then conditionals, such that the antecedent and consequent of a given conditional refer (in general) to two distinct columns, a rule violation may indicate that there is a problem with the value of either or both. Whenever a rule violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell, identifying the particular violation that occurred (see the section on rule violation IDs) and its associated level and description as found in the rule table.

Validating datatypes

This step of the validation process determines whether a cell's value violates the datatype condition, as defined in the datatype table, for the datatype associated with the cell's column in the column table. When a datatype violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell, identifying the particular datatype violation that occurred (see the section on rule violation IDs). The text of the message is taken from the description of the datatype that has been violated (see the datatype table).

Validating foreign constraints

This step in the validation process verifies, for a given cell, that if the cell's associated column, C, has been configured with a structure of the form from(T, F) (see the column-table), where T is a foreign table and F is a column in T, then the cell's value (or values if C's datatype is a list datatype) is (are) among the values of F. Note that if the foreign table has a _conflict version (see the table table), then this function will distinguish between (a) the case in which a given value is not found in either the foreign table or its associated conflict table, and (b) the case in which a given value is found only in the conflict table. When a foreign constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the identifier key:foreign (see also the section on rule violation IDs). The text of the message will be of the form: Value 'V' of column C is not in T.F, whenever V is neither found in the normal version of the table nor (if applicable) its conflict version, and it will be of the form Value 'V' of column C exists only in T_conflict.F whenever V exists in T_conflict but not in T.

Validating primary and unique constraints

This step in the validation process verifies, for a given cell, that if the cell's associated column has been configured with either a primary or a unique constraint (see the column-table), then the cell's value is not among the values of the column that have already been inserted into the table, neither in the normal version of the table nor in the conflict version (in the case where the conflict option has been set). When a primary or unique constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Values of COLUMN must be unique.

Validating tree-foreign keys

When a column, column2 has a structure tree(column1) defined on it in the column table, then all non-null values of column2 must exist in column1. This step in the validation process function verifies, for a given table, whether any of the values of any of the cells in any of the rows violate any of the table's tree() conditions. When a primary or unique constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Value 'VAL' of column COLUMN2 is not in COLUMN1.

Batch validation

The algorithm described in the previous section is applicable to a single row of data. When initially loading the many rows of a data table from its source into the database, however, a number of optimizations may be used to speed up the validation process. In particular, some of the steps for validating a row's cells from the previous section do not need to refer to any data external to a given row, and may therefore be performed in parallel. These are steps 1–3, i.e., the nulltype, rules, and datatype validation steps. Only the remaining steps: foreign and unique/primary constraint validation, need be performed row by row.

Editing the data

Once a data table has been loaded into the database and validated, it may be desirable to further edit the data, for instance to resolve any outstanding validation messages, or for some other reason. The possible operations that can be performed on the data are insert_row(), update_row(), delete_row(), and move_row(). In addition the function validate_row() is used to generate validation information about the row during the editing process.

Note that insert_row(), update_row(), and validate_row() require that a row be specified in the following "simple" form:

{
    "column_1": value1,
    "column_2": value2,
    ...
},

The output of these three functions is a ValveRow (see representing validated data).

After editing the data, a record of the change will be inserted to the history table. The ValveRow representing the row as it was before the change (in the case of an update, delete, or move operation), and the ValveRow representing it as it is after the change (in the case of an update, move, or insert), will be used as the values of the from and to columns, respectively. Note that to represent a ValveRow in the database, "rich" JSON format is used:

{
    "column_1": {
        "valid": <true|false>,
        "messages": [{"level": level, "rule": rule, "message": message}, ...],
        "value": value1
    },
    "column_2": {
        "valid": <true|false>,
        "messages": [{"level": level, "rule": rule, "message": message}, ...],
        "value": value2
    },
    ...
},

Valve also provides the undo() and redo() functions, to undo the last insert, update, or delete operation, and to redo the last undo, respectively. For more information on the data manipulation and browsing operations provided by Valve see the section on the Valve API.

Saving data tables

To save a data table or tables to a '.tsv' file, one uses the API functions, save_all_tables(), save_tables(), or save_table(). Note that unless the save option has been set, it is not possible to overwrite the source file from which the table was originally loaded. One must save the table to an alternate location. For more information on saving data tables Valve see the section on the Valve API.

Installation and configuration

Prerequisites

Installation

Binary installation

Option 1: Build and install the binary for the latest release using crates.io by running:

cargo install ontodev_valve

Option 2: Download the appropriate binary for your system from the release page and copy it to your system's executable path.

Source installation

It is also possible to build the binary yourself from the source code. This is required if you would like to work with a version of Valve that is newer than the latest release. Begin by using git clone to clone the source code repository into a local folder:

git clone git@github.com:ontodev/valve.rs.git

Next, use cargo to build the ontodev_valve binary (this may take awhile to complete and will produce a lot of output):

cargo build --release

After the build has completed successfully, the ontodev_valve binary may be found (relative to the repository's root directory) in the target/release/ subdirectory. You should copy or move it from here to your user or system-wide bin/ directory, or to some other convenient location. For example,

cp target/release/ontodev_valve ~/bin/

Finally, run

ontodev_valve --help

to verify that Valve was installed correctly.

Configuration

Valve is configured primarily using a number of special configuration tables that can be represented as '.tsv' files. The most important of these is the table called 'table', also known as the table table. A table table configuration is required to use Valve. A column table and datatype table configuration are required as well. Optionally, the user may also specify a rule table configuration.

The table table is alone among the configuration tables in that it cannot be given an arbitrary name but must always be given the name 'table'. This is not the case for the column, datatype and rules tables. Although it is recommended to use the names 'column, 'datatype', and 'rule', respectively, alternate names may be chosen for these tables as explained below.

The table table

One normally configures the table table using a '.tsv' file called 'table.tsv', although any filename could be used in principle, and in fact it is also possible to read the table table directly from the database, as long as the database already contains a table called 'table' with the right setup.

Below is an example of a table table:

table path description type options
table schema/table.tsv table
column schema/column.tsv column
datatype schema/datatype.tsv datatype
rule schema/rule.tsv rule
user_table1 schema/user/user_table1.tsv
user_readonly_table1 schema/user/user_readonly_table1.tsv no-edit no-save no-conflict
user_view1 schema/user/user_view1.sql db_view
user_view2 schema/user/user_view2.sh db_view
user_view3 db_view

Note that in the first row above the table being described is the table table itself. In general the columns of the table table have the following significance:

Further information on path

The path column indicates where the data for a given table may be found. It can be (a) a '.tsv' file, (b) a '.sql' file, (c) some other executable file, or (d) it may be empty. In each case it will have the following consequences for the possible values of type and and for the possible options that may be used with the table.

  1. If path ends in '.tsv':

    • Its associated type may be any one of table, column, datatype, rule, or it may be empty.
    • The db_view option is not allowed. If set to true, Valve will fail with a "'.tsv' files are not supported for views" error.
  2. If path does not end in '.tsv':

    • Its associated type must be empty. Note that this implies that Valve's configuration tables must be specified using '.tsv' files.
    • The edit option is not allowed. If set to true, Valve will fail with an "Editable tables require a path that ends in '.tsv'" error.
  3. If path either ends in '.sql', or represents a generic executable:

    • If the db_view option is set, Valve assumes that the '.sql' file or generic executable indicated contains the instructions necessary to create the view and that the view displays the appropriate data.
    • If the db_view option is not set, Valve will take care of creating the table but it will expect that the '.sql' file or generic executable contains the statements necessary to load the data.
    • If path ends in '.sql', Valve reads in the statements contained in the '.sql' file and then executes them against the database.
    • If path represents a generic executable, then it is executed as a shell process. Note that the executable script or binary must accept two arguments indicating the location of the database and the name of the table that the path corresponds to.
  4. If path is empty:

    • If the db_view option is set, Valve assumes that the view already exists in the database.
    • If the db_view option is not set, Valve will take care of creating the table but it will not attempt to load it.

The above is conveniently summarized in the following table:

path possible types possible options created by loaded by†
Ends in '.tsv' any valid type, or empty db_view not allowed Valve Valve
Ends in '.sql' must be empty edit not allowed db_view: the '.sql' file, db_table: Valve Valve
Generic executable must be empty edit not allowed db_view: the executable, db_table: Valve Valve
empty must be empty edit not allowed No one (assumed to already exist) No one (assumed to be already loaded)

† Note that loading is only applicable when the db_view option has not been set.

Further information on options

If no options are specified, the options db_table, truncate, load, save, edit, validate_on_load, and conflict will all be set to true by default. The complete list of allowable options, and their meanings, are given below:

Commonly used path and option combinations

Here are some examples of commonly used table table configurations:

table path description type options
plain src/data/plain.tsv A database table with default options set, backed by a '.tsv' file
appendonly src/data/appendonly.tsv A database table which is appended to but never truncated no-truncate
readonly1 src/data/readonly1.tsv A read-only table that is backed by a '.tsv' file no-edit no-save
readonly2 src/data/readonly2.sh A read-only table whose contents are assumed to be valid, backed by a generic executable no-edit no-conflict no-validate-on-load
view1 src/data/view1.sql A view backed by a '.sql' file db_view
table src/table.tsv the table table table
column src/column.tsv the column table column
datatype src/datatype.tsv the datatype table datatype
rule src/rule.tsv the rule table rule

The column table

In addition to a table table, Valve also requires a column table. The column table configuration is normally stored in a file called 'column.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'column' in the table table. The column table contains one row for every column of every configured table. This includes both special configuration tables (such as the column table itself) and user-defined tables.

Below is an example column table, with the special configuration tables omitted:

table column label nulltype default datatype structure description
table1 column1 Column 1 value1
table1 column2 Column 2 empty integer from(table2.column2)
table2 column1 Column 1 trimmed_line primary
table2 column2 Column 2 integer unique
table3 column1 Column 1 word primary
table3 column2 Column 2 empty word tree(column1)

The columns of the column table have the following significance:

The datatype table

In addition to the table table and the column table, Valve also requires the user to configure a datatype table. The datatype table configuration is normally stored in a file called 'datatype.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'datatype' in the table table. The datatype table stores the definitions of the datatypes referred to in the datatype column of the column table.

Below is a subset of the rows of an example datatype table:

datatype parent condition description sql_type HTML type format
text any text TEXT textarea
integer text match(/-?\d+/) a positive or negative decimal digit, or 0 INTEGER %i
word text exclude(/\W/) a single word: letters, numbers, underscore %s
empty text equals('') the empty string NULL
custom1 text match(/\S+:\S+/) two nonspace character sequences separated by ':'
custom2 text in(alice, bob, cindy) either 'alice', 'bob', or 'cindy'
custom3 text list(word, ' ') a list of words separated by spaces
custom4 text search(/\d+/) a string containing a sequence of digits

The columns of the datatype table have the following significance:

Condition types
Required datatypes

Valve requires that the following datatypes be defined:

The recommended datatype configurations for these four datatypes are the following (note that the HTML type and format columns are optional):

datatype parent condition description sql_type HTML type format
text TEXT textarea
empty text equals('') NULL
line text exclude(/\n/) input
trimmed_line line match(/\S([^\n]\S)/)
nonspace trimmed_line exclude(/\s/)
word nonspace exclude(/\W/) %s

The rule table

In addition to the table table, the column table, and the datatype table, it is also possible (but optional) to configure a table of type 'rule', or a rule table. When it is configured, the rule table configuration is normally stored in a file called 'rule.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'rule' in the table table.

The rule table is used to define a number of rules of the following form:

        when CONDITION_ON_COLUMN_1 is satisfied then CONDITION_ON_COLUMN_2 must also be satisfied         (1)

where CONDITION_ON_COLUMN_1 and CONDITION_ON_COLUMN_2 are defined with respect to the same table.

Below is an example rule table:

table when column when condition then column then condition level description
table1 foo null bar not null error
table2 foo negative_int bar positive_int error

The columns of the rule table have the following significance:

Using guess

In some cases it is useful to be able to try and guess what the table table and column table configuration should be, using information about the current state of the Valve instance, for a given data table not currently managed by Valve. To do this one may use Valve's command line interface to run the guess subcommand as follows:

ontodev_valve guess [OPTIONS] SOURCE DATABASE TABLE_TSV

where:

For the list of possible options, and for general information on Valve's command line interface, see command line usage.

Below is an example of using guess:

$ ontodev_valve guess test/guess_test_data/table.tsv build/valve_guess.db test/guess_test_data/ontology/table2.tsv 

The following row will be inserted to "table":
table                                     path                                      type                                      description                               
table2                                    test/guess_test_data/ontology/table2.tsv                                                                                      

The following row will be inserted to "column":
table               column              label               nulltype            datatype            structure           description         
table2              zork                                                        integer             primary                                 
table2              zindy                                                       IRI                 unique                                  
table2              xyzzy                                                       suffix              unique                                  
table2              foo                                                         prefix                                                      
table2              bar                                                         IRI                 unique                                  

Do you want to write this updated configuration to the database? [y/N] y

Command line usage

The basic syntax when calling Valve on the command line is:

ontodev_valve [OPTIONS] <SUBCOMMAND <SUBCOMMAND POSITIONAL PARAMETERS>>

To view the list of possible subcommands and global options, run:

ontodev_valve --help

To get help on a particular subcommand, and on the positional parameters and options that are specific to it, run:

ontodev_valve SUBCOMMAND --help

Logging

By default Valve only logs error messages. To also enable warning and information messages, set the environment variable RUST_LOG to the minimum logging level desired for ontodev_valve: debug, info, warn, or error. For instance:

export RUST_LOG="ontodev_valve=info"

For further information see the Rust Cookbook.

API

The API (application-programmer interface) reference documentation for the latest release of Valve can normally be found on docs.rs. In case you would like to generate the API documentation for a newer version of Valve, run:

cargo doc

in the the root folder of your local copy of the valve.rs source code repository (see Installation and configuration), and then open the file

<valve.rs root folder>/target/doc/ontodev_valve/index.html

in your favourite browser. For a higher level See also the Design and concepts section.

Troubleshooting

TODO.

Python bindings

See valve.py

License: BSD-3-Clause