ontodev_valve/valve.rs

Valve - A lightweight validation engine written in rust.

ToC Generated using markdown-toc

Design and concepts
Installation and configuration
Command line usage
Logging
API
Troubleshooting
Python bindings

Design and concepts

Valve reads in the contents of user-defined data tables and represents them using a spreadsheet-like structure incorporating annotated cells that may subsequently be edited and saved. The annotations are of two kinds:

Data validation messages indicating that the value of a particular cell violates one or more of a number of user-defined data validation rules.
Data update messages indicating that the value of a particular cell has changed as well as the details of the change.

After reading and optionally validating the initial data, the validated information is loaded into a database that can then be queried directly using your favourite SQL client (e.g., sqlite3 or psql) as in the examples below. The currently supported databases are SQLite and PostgreSQL. Valve further provides an Application-programmer interface (API) that can be used to incorporate Valve's data validation and manipulation functionality and to visualize the validated data within your own rust project. For an example of a rust project that uses Valve, see ontodev/nanobot.rs. Finally, Valve also provides a Command line interface (CLI) for some of its more important functions like load, save, and guess.

The Valve database

Data tables and views

Below is an example of one of the views Valve provides into a data table in which three of the five rows contain invalid data, and one row has been edited since it was initially loaded. The output is from a psql session connected to a PostgreSQL database called 'valve_postgres'.

valve_postgres=> select * from table11_text_view

row_number	row_order	message	history	rank	name	employee_id
1	1000			Eastern division head	Denise Laviolette	denise_l
2	2000			Southwest division head	Eddie Cahill	eddie_c
3	3000	`[{"column":"employee_id","value":"denise_l","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}]`	`[[{"column":"rank","level":"update","message":"Value changed from 'intern' to 'Chief executive officer'","old_value":"intern","value":"Chief executive officer"}]]`	Chief executive officer	Denise Laurier	denise_l	c	f
4	4000	`[{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}]`		Development team lead	Edward Coulombe	eddie_c
5	5000	`[{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}]`		Test team lead	Ed Crantivera	eddie_c

(5 rows)

Here, row_number is a fixed identifier assigned to each row at the time of creation, while row_order, which in principle might change multiple times throughout the lifetime of a given row, is used to manipulate the logical order of rows in the table. The message column, in this example, conveys that the value 'denise_l' of the column "employee_id" in row 3, and the value 'eddie_c' of the same column in rows 4 and 5, are duplicates of already existing values of that column, which happens to be a primary key, and are therefore invalid. We can also see, from the history column, that there has been one change, in which one of the values in row 3, namely the value of the column "rank", has been changed from 'intern' to 'Chief executive officer'. Finally, the names of the columns to the right of the history column correspond to the column names of the source table and will therefore vary from table to table. Normally these column names are specified in the header of a '.tsv' file from which the source data is read, though see below for alternate input data formats and associated table options. In any case the data contained in the columns to the right of the history column will exactly match the contents of the source table unless the data has been edited since it was initially loaded.

For the example below we will assume that a file named artists.tsv exists on your hard disk in your current working directory with the following contents:

name	type	number_of_members	health_insurance_provider	health_insurance_id	health_insurance_id_suffix
The Jimi Hendrix Experience	band	3	Pittsfield Medical	9834564422
Cream	band	3	Blue Cross	9388883311	XX54
Jennifer Lopez	solo		Medi-Assist	MA67886666881
Janice Joplin	solo		Pittsfield Medical	FFFHYZDFJ432
Chrissie Hynde	solo		Blue Cross	4422393877
Van Halen	band	5	Blue Cross	9476587117	BBDC
Van Morrison	solo		Medi-Assist	MA67920004571
Paul McCartney	solo		Medi-Assisr	MA60768763987
The Band	band	five	Blue Cross	0831133887
Bob Dylan	solo		Pittsfield Medical	FFF GYU ZKJ 954
Van Halen	band		Blue Cr.	9476587117

We will also assume that a file named providers.tsv exists on your hard disk in your current working directory with the following contents:

name	address
Blue Cross	123 Fake Street, Fake Town, USA, 55123
Medi-Assist	933 Phoney Boulevard, Accra, Ghana, GA008
Pittsfield Medical	510 North Street, Pittsfield, MA, 01201

In order for Valve to read this table it must first be configured to do so. This is done using a number of special data tables called configuration tables, usually represented using further '.tsv' files, that contain information about:

Where to find the data tables and other general properties of each managed data table (i.e., the table table).
The datatypes represented in the various data tables (i.e., the datatype table).
Information about the columns of each data table: their associated datatypes, data dependencies between the values in one table column and some other one, and any other constraints or restrictions on individual columns (i.e., the column table).
Rules constraining the joint values of two different cells in a single given row of data (i.e., the rule table).

For our example we will assume that Valve's configuration tables contain the following entries:

Table table

table	path	description	type	options
artists	artists.tsv
providers	providers.tsv

Column table

table	column	nulltype	datatype	structure
artists	name		trimmed_line	primary
artists	type		trimmed_line
artists	number_of_members	empty	integer
artists	health_insurance_provider		trimmed_line	from(providers.name)
artists	health_insurance_id		nonspace
artists	health_insurance_id_suffix	empty	word
providers	name		trimmed_line	primary
providers	address		text

Datatype table

datatype	parent	condition	sql_type
integer	nonspace	match(/-?\d+/)	INTEGER
nonspace	trimmed_line	exclude(/\s/)
trimmed_line	line	match(/\S([^\n]\S)/)
line	text	exclude(/\n/)
word	nonspace	exclude(/\W/)
empty	text	equals('')
text			TEXT

Rule table

table	when_column	when_condition	then_column	then_condition	level	description
artists	type	equals(band)	number_of_members	not null	error	a band must specify the number of members
artists	health_insurance_provider	equals('Blue Cross')	health_insurance_id_suffix	not null	error	a health insurance id suffix is required for Blue Cross members
artists	health_insurance_provider	equals('Pittsfield Medical')	health_insurance_id	word	error	a Pittsfield Medical ID must consist in a single word

For the meanings of all of the columns in the configuration tables above, see the section on configuration. In the rest of this section we'll refrain from explaining the meaning of a particular configuration table column unless and until it becomes relevant to our example. What is relevant at this point is only that each configuration table is also a data table whose contents are themselves subject to validation by Valve. In other words Valve will not necessarily fail to run if there are errors in its configuration (as long as those errors aren't critical) and it can moreover help to identify what those errors are. The upshot is that almost everything I mention below regarding artists also applies to the special configuration tables table, column, datatype and rule unless otherwise noted.

Once it has been read in by Valve from its source file, a given logical table will be represented, in the database, by between one and two database tables and by as many as two database views. In our example, the source data, contained in the file 'artists.tsv', represents (according to the table table configuration) a normal data table with the default options set, which means that all four database tables and views will be created. These are:

artists: The database table, having the same name as the logical table, that normally contains the bulk of the table's data.
artists_conflict (only when the conflict option is set): Valve aims to represent user data whether or not that data is valid. When it is not valid, however, this presents an obstacle when it comes to representing it in a relational database like SQLite or PostgreSQL. For instance, if a table contains a primary key column, then data rows containing values for the column that already exist in the table should be marked as invalid, but still somehow represented by Valve. However the implied database constraint on the database table will prevent us from inserting duplicate values to a primary key column. A similar issue arises for unique and foreign key constraints. To get around the database limitations that are due to database constraints such as primary, unique, or foreign keys, Valve constructs a _conflict table that is identical to the normal version of the table, but that does not include those keys. In our example, any rows containing duplicate values of the primary key column will be inserted to the database table called artists_conflict, while the other rows will be inserted to artists.
artists_view (only when the conflict option is set): When Valve validates a row from the logical table, artists, the validation messages it generates are not stored in the database table called artists but instead in a special internal database table called message. Similarly, when Valve adds, updates, or deletes a row from artists, a record of the change is not stored in artists, but in a special internal database table called history. Because it can be convenient for information about the validity of a particular cell to be presented side by side with the value of the cell itself, Valve constructs a table called artists_view which combines information from artists and artists_conflict with the message and history tables.
artists_text_view (only when the conflict option is set): In addition to the restrictions associated with primary, unique, and foreign key constraints, the database will also not allow us to represent values that are invalid due to an incorrect SQL type, in particular when one attempts to insert a non-numeric string value to a numeric column of a database table. To get around this limitation, Valve constructs a database view called artists_text_view. Unlike artists_view, which is defined such that the datatype of each column in artists_view exactly matches the datatype of the corresponding datatype in artists, in artists_text_view all of the columns are cast to TEXT so that no SQL datatype errors can occur when representing the data.

After loading the data from artists.tsv into the database, these four tables and views will be found to have the following contents:

valve_postgres=> select * from artists;

row_number	row_order	name	type	number_of_members	health_insurance_provider	health_insurance_id	health_insurance_id_suffix
1	1000	The Jimi Hendrix Experience	band	3	Pittsfield Medical	9834564422
2	2000	Cream	band	3	Blue Cross	9388883311	XX54
3	3000	Jennifer Lopez	solo		Medi-Assist	MA67886666881
4	4000	Janice Joplin	solo		Pittsfield Medical	FFFHYZDFJ432
6	6000	Van Halen	band	5	Blue Cross	9476587117	BBDC
7	7000	Van Morrison	solo		Medi-Assist	MA67920004571

(6 rows)

valve_postgres=> select * from artists_conflict

row_number	row_order	name	type	number_of_members	health_insurance_provider	health_insurance_id
5	5000	Chrissie Hynde	solo		Blue Cross	4422393877
8	8000	Paul McCartney	solo		Medi-Assisr	MA60768763987
9	9000	The Band	band		Blue Cross	0831133887
10	10000	Bob Dylan	solo		Pittsfield Medical	FFF GYU ZKJ 954
11	11000	Van Halen	band	5	Pittsfield Med.	9476587117

(5 rows)

valve_postgres=> select * from artists_view order by row_number;

row_number	row_order	name	type	number_of_members	health_insurance_provider	health_insurance_id	health_insurance_id_suffix	message
1	1000	The Jimi Hendrix Experience	band	3	Pittsfield Medical	9834564422
2	2000	Cream	band	3	Blue Cross	9388883311	XX54
3	3000	Jennifer Lopez	solo		Medi-Assist	MA67886666881
4	4000	Janice Joplin	solo		Pittsfield Medical	FFFHYZDFJ432
5	5000	Chrissie Hynde	solo		Blue Cross	4422393877		[{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}]
6	6000	Van Halen	band	5	Blue Cross	9476587117	BBDC
7	7000	Van Morrison	solo		Medi-Assist	MA67920004571
8	8000	Paul McCartney	solo		Medi-Assisr	MA60768763987		[{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}]
9	9000	The Band	band		Blue Cross	0831133887		[{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}]
10	10000	Bob Dylan	solo		Pittsfield Medical	FFF GYU ZKJ 954		[{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}]
11	11000	Van Halen	band	5	Pittsfield Med.	9476587117		[{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}]

(11 rows)

valve_postgres=> select * from artists_text_view order by row_number;

row_number	row_order	name	type	number_of_members	health_insurance_provider	health_insurance_id	health_insurance_id_suffix	message
1	1000	The Jimi Hendrix Experience	band	3	Pittsfield Medical	9834564422
2	2000	Cream	band	3	Blue Cross	9388883311	XX54
3	3000	Jennifer Lopez	solo		Medi-Assist	MA67886666881
4	4000	Janice Joplin	solo		Pittsfield Medical	FFFHYZDFJ432
5	5000	Chrissie Hynde	solo		Blue Cross	4422393877		[{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}]
6	6000	Van Halen	band	5	Blue Cross	9476587117	BBDC
7	7000	Van Morrison	solo		Medi-Assist	MA67920004571
8	8000	Paul McCartney	solo		Medi-Assisr	MA60768763987		[{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}]
9	9000	The Band	band	five	Blue Cross	0831133887		[{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}]
10	10000	Bob Dylan	solo		Pittsfield Medical	FFF GYU ZKJ 954		[{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}]
11	11000	Van Halen	band	5	Pittsfield Med.	9476587117		[{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}]

(11 rows)

The message and history tables

In the previous section we mentioned that artists_view and artists_text_view are defined in terms of a query that draws on data from artists and artists_conflict as well as from the special internal tables called message and history. The information in these tables can also be queried directly. After loading the data from artists.tsv into the database, one will find that the message table contains the following contents:

valve_postgres=> select * from message where "table" = 'artists';

message_id	table	row	column	value	level	rule	message
62	artists	5	health_insurance_provider	Blue Cross	error	rule:health_insurance_provider-1	a health insurance id suffix must be specified for Blue Cross members
63	artists	8	health_insurance_provider	Medi-Assisr	error	key:foreign	Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name
64	artists	9	number_of_members	five	error	datatype:integer	number_of_members should be a positive or negative integer
65	artists	9	health_insurance_provider	Blue Cross	error	rule:health_insurance_provider-1	a health insurance id suffix must be specified for Blue Cross members
66	artists	10	health_insurance_provider	Pittsfield Medical	error	rule:health_insurance_provider-2	a Pittsfield Medical health insurance id must be a single word
67	artists	10	health_insurance_id	FFF GYU ZKJ 954	error	datatype:nonspace	health_insurance_id should be text without whitespace
68	artists	11	name	Van Halen	error	key:primary	Values of name must be unique
69	artists	11	health_insurance_provider	Pittsfield Med.	error	key:foreign	Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name

(8 rows)

Here,

message_id is a unique identifier for this message assigned when it is created.
table is the table that this message is associated with.
row is the row_number of the table that this message is associated with.
column is the column in the row of the table that this message is associated with.
value is what the message is about.
level is the severity of the message.
rule identifies the rule that, when applied to the given value, resulted in the message.
message is the text of the message.

The history table will be empty immediately after the initial loading of the database.

valve_postgres=> select * from history where "table" = 'artists';

history_id	table	row	from	to	summary	user	undone_by	timestamp

(0 rows)

Here,

history_id is a unique identifier for this history record that is assigned when it is created.
table is the table that this record is associated with.
row is the row number of the table that this record is associated with.
from is a rich JSON representation of the row as it was before the change (see the section on data validation).
to is a rich JSON representation of the row after the change (see the section on data validation).
user is the name of the user that initiated the change, or who undid or redid the change (if applicable).
undone_by is the name of the user that undid the change (if applicable).
timestamp records the time of the change, or the time that it was undone or redone (if applicable).

summary is a JSON representation of a summary of the change in the form of an array of records corresponding to each column that has changed. e.g.,

    [
        {
            "column":"name",
            "level":"update",
            "message":"Value changed from 'Prince' to 'The artist formerly known as Prince'",
            "old_value":"Prince",
            "value":"The artist formerly known as Prince"
        },
        {
            "column":"health_insurance_provider",
            "level":"update",
            "message":"Value changed from 'Blue Cross' to 'Medi-Assist'",
            "old_value":"Blue Cross",
            "value":"Medi-Assist"
        },
        {
            "column":"health_insurance_id",
            "level":"update",
            "message":"Value changed from '9933991133' to 'MA0001122344'",
            "old_value":"9933991133",
            "value":"MA0001122344"
        },
        {
            "column":"health_insurance_id_suffix",
            "level":"update",
            "message":"Value changed from 'GG77' to ''",
            "old_value":"GG77",
            "value":""
        }
    ]

Note that the summary column of the history table is where the information in the view columns artists_view.history and artists_text_view.history is taken from, which is why these view columns are always in the form of an array of arrays, i.e., an array of summary records for the given row.

Data validation and editing

Representing validated data

For the purposes of this section we will define a cell as a single value, that can either be valid or invalid, of one of the columns of a given data table. We will further define a row of data, from a given data table, to be an array of correlated cells, such that for every column in the table there is one and only one cell in the row corresponding to it. Once a given row of data has been inserted into the database and (optionally) validated, Valve represents it using a ValveRow struct (see the API reference) containing the following information:

The row's fixed, unique, identifier, or row_number.
A map from the names of the columns contained in the row, to the ValveCell struct associated with each, where the latter represents the results of running Valve's validation engine (see the validation process) on the cell associated with that column.

In particular, each ValveCell contains the following information:

The value of the cell.
Whether the value should be interpreted as a null value of the given column, and if so, the column's nulltype.
Whether or not the value is valid given the column's column configuration.
A list of ValveCellMessage structs representing the validation messages associated with this cell.

Each ValveCellMessage, in turn, contains the following information:

The level, or severity, of the message (e.g., "error", "warn", "info").
An alphanumeric identifier for the rule violation described by the message (see the section on rule violation identifiers).
The text of the message.

The following is a textual representation of an example of a ValveRow:

ValveRow {
    row_number: Some(
        11,
    ),
    contents: {
        "id": ValveCell {
            nulltype: None,
            value: String("BFO:0000027"),
            valid: true,
            messages: [],
        },
        "name": ValveCell {
            nulltype: None,
            value: String("Mike"),
            valid: true,
            messages: [],
        },
        "location": ValveCell {
            nulltype: None,
            value: String("baree"),
            valid: false,
            messages: [
                ValveCellMessage {
                    level: "error",
                    rule: "key:foreign",
                    message: "Value 'baree' of column location is not in cities.city_name",
                },
            ],
        },
        "preferred_seafood": ValveCell {
            nulltype: Some("empty"),
            value: String(""),
            valid: true,
            messages: [],
        },
    },
}

Rule violation identifiers

Valve uses the following to identify the rule that has been violated by a given cell value in the ValveCellMessage associated with the violation:

key:foreign: The column that the given value belongs to has a from() structure (see the column table) that references some column, F, in another table, but the given value is not in F.
key:primary: The column that the given value belongs to has a primary structure, and the given value already exists in the column.
key:unique: The column that the given value belongs to has a unique structure, and the given value already exists in the column.
option:unrecognized (table table only): The list of options specified in the options column of the table table contains an unrecognized option.
option:redundant: (table table only): The list of options specified in the options column of the table table contains an option that is already implied by one of the other options.
option:reserved: (table table only): The list of options specified in the options column of the table table contains an option keyword that is reserved for internal use.
option:overrides: (table table only): The list of options specified in the options column of the table table contains an option that overrides one of the other options.
tree:foreign: The column that the given value belongs to has a tree() structure that references some other column, T, of the same table; but the given value is not in T.
datatype:DATATYPE: The column that the given value belongs to has the datatype, DATATYPE, but applying DATATYPE's associated condition to the given value results in a failure.
rule:COLUMN-N: The given value of COLUMN causes the Nth rule in the rule table whose when_column is COLUMN to be violated.

The validation process

Validating a row of data

For a given row of data in a given data table, Valve's validation process consists in a series of checks that are schematically represented in the flowchart below and subsequently explained in more detail.

flowchart TD
    node1["1. Determine the nulltype (if any) of each cell in the row"]
    node1 -- "Then, for each cell:" --> node2
    node2["2. Validate rules"]
    node2 --> modal1
    modal1{"Does the cell have a nulltype?"}
    modal1 -- No --> node3
    modal1 -- Yes, skip further validation for this cell --> modal3
    node3["3. Validate datatype"]
    node3 --> modal2
    modal2{"Does the cell value contain a SQL type error?"}
    modal2 -- No --> node4
    modal2 -- Yes --> modal3
    node4["4. Validate foreign constraints"]
    node4 --> node5
    node5["5. Validate primary and unique Constraints"]
    node5 --> modal3
    modal3{"Have we iterated over all of the cells in the row?"}
    modal3 -- Yes, then over the table as a whole: --> node6
    modal3 -- No, go on to the next cell --> node2
    node6["6. Validate tree-foreign keys"]

Determining the nulltype of a cell

The validation process begins by determining, for each cell in the row, whether the value of that cell matches the nulltype (if any) of its associated column, as defined in the column table. In particular, if the value of the cell matches the nulltype of its associated column, then the nulltype field of the ValveCell struct used to represent the cell will be set to indicate that the value is a null value of that type. Otherwise the nulltype field will remain unset, indicating that the value is not a null value. For instance, suppose that the cell value is '' (i.e., the empty string), and that the nulltype for its associated column, as defined in the column table, is empty. Since empty's associated condition, as defined in the datatype table is equals(''), applying it to the cell value will result in a match, and Valve will set the nulltype field for the ValveCell representing this particular cell to empty. In the case where the value of the cell does not match the condition associated with the datatype, empty, (i.e., when the cell value is something other than an empty string), the validation process will leave the nulltype field of the ValveCell unset.

Validating rules

This step of the validation process determines whether any of the rules in the rule table that are applicable to a cell have been violated. A rule in the rule table is applicable to a cell when the cell's associated column is the same as the when_column associated with the rule. Note that since the rules in the rule table correspond to if-then conditionals, such that the antecedent and consequent of a given conditional refer (in general) to two distinct columns, a rule violation may indicate that there is a problem with the value of either or both. Whenever a rule violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell, identifying the particular violation that occurred (see the section on rule violation IDs) and its associated level and description as found in the rule table.

Validating datatypes

This step of the validation process determines whether a cell's value violates the datatype condition, as defined in the datatype table, for the datatype associated with the cell's column in the column table. When a datatype violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell, identifying the particular datatype violation that occurred (see the section on rule violation IDs). The text of the message is taken from the description of the datatype that has been violated (see the datatype table).

Validating foreign constraints

This step in the validation process verifies, for a given cell, that if the cell's associated column, C, has been configured with a structure of the form from(T, F) (see the column-table), where T is a foreign table and F is a column in T, then the cell's value (or values if C's datatype is a list datatype) is (are) among the values of F. Note that if the foreign table has a _conflict version (see the table table), then this function will distinguish between (a) the case in which a given value is not found in either the foreign table or its associated conflict table, and (b) the case in which a given value is found only in the conflict table. When a foreign constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the identifier key:foreign (see also the section on rule violation IDs). The text of the message will be of the form: Value 'V' of column C is not in T.F, whenever V is neither found in the normal version of the table nor (if applicable) its conflict version, and it will be of the form Value 'V' of column C exists only in T_conflict.F whenever V exists in T_conflict but not in T.

Validating primary and unique constraints

This step in the validation process verifies, for a given cell, that if the cell's associated column has been configured with either a primary or a unique constraint (see the column-table), then the cell's value is not among the values of the column that have already been inserted into the table, neither in the normal version of the table nor in the conflict version (in the case where the conflict option has been set). When a primary or unique constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Values of COLUMN must be unique.

Validating tree-foreign keys

When a column, column2 has a structure tree(column1) defined on it in the column table, then all non-null values of column2 must exist in column1. This step in the validation process function verifies, for a given table, whether any of the values of any of the cells in any of the rows violate any of the table's tree() conditions. When a primary or unique constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Value 'VAL' of column COLUMN2 is not in COLUMN1.

Batch validation

The algorithm described in the previous section is applicable to a single row of data. When initially loading the many rows of a data table from its source into the database, however, a number of optimizations may be used to speed up the validation process. In particular, some of the steps for validating a row's cells from the previous section do not need to refer to any data external to a given row, and may therefore be performed in parallel. These are steps 1–3, i.e., the nulltype, rules, and datatype validation steps. Only the remaining steps: foreign and unique/primary constraint validation, need be performed row by row.

Editing the data

Once a data table has been loaded into the database and validated, it may be desirable to further edit the data, for instance to resolve any outstanding validation messages, or for some other reason. The possible operations that can be performed on the data are insert_row(), update_row(), delete_row(), and move_row(). In addition the function validate_row() is used to generate validation information about the row during the editing process.

Note that insert_row(), update_row(), and validate_row() require that a row be specified in the following "simple" form:

{
    "column_1": value1,
    "column_2": value2,
    ...
},

The output of these three functions is a ValveRow (see representing validated data).

After editing the data, a record of the change will be inserted to the history table. The ValveRow representing the row as it was before the change (in the case of an update, delete, or move operation), and the ValveRow representing it as it is after the change (in the case of an update, move, or insert), will be used as the values of the from and to columns, respectively. Note that to represent a ValveRow in the database, "rich" JSON format is used:

{
    "column_1": {
        "valid": <true|false>,
        "messages": [{"level": level, "rule": rule, "message": message}, ...],
        "value": value1
    },
    "column_2": {
        "valid": <true|false>,
        "messages": [{"level": level, "rule": rule, "message": message}, ...],
        "value": value2
    },
    ...
},

Valve also provides the undo() and redo() functions, to undo the last insert, update, or delete operation, and to redo the last undo, respectively. For more information on the data manipulation and browsing operations provided by Valve see the section on the Valve API.

Saving data tables

To save a data table or tables to a '.tsv' file, one uses the API functions, save_all_tables(), save_tables(), or save_table(). Note that unless the save option has been set, it is not possible to overwrite the source file from which the table was originally loaded. One must save the table to an alternate location. For more information on saving data tables Valve see the section on the Valve API.

Installation and configuration

Prerequisites

Building Valve from source requires installing a rust development environment.
Valve also requires database management software to be installed. The currently supported databases are SQLite and PostgreSQL. In the latter case the database server must be up and running and ready to accept connections. Note that although the database may be queried directly by using your favourite SQL client (e.g., sqlite3 or psql), one advantage to viewing and manipulating the data through Valve is that syntax and other differences between the two database management systems become transparent to the user.

Installation

Binary installation

Option 1: Build and install the binary for the latest release using crates.io by running:

cargo install ontodev_valve

Option 2: Download the appropriate binary for your system from the release page and copy it to your system's executable path.

Source installation

It is also possible to build the binary yourself from the source code. This is required if you would like to work with a version of Valve that is newer than the latest release. Begin by using git clone to clone the source code repository into a local folder:

git clone git@github.com:ontodev/valve.rs.git

Next, use cargo to build the ontodev_valve binary (this may take awhile to complete and will produce a lot of output):

cargo build --release

After the build has completed successfully, the ontodev_valve binary may be found (relative to the repository's root directory) in the target/release/ subdirectory. You should copy or move it from here to your user or system-wide bin/ directory, or to some other convenient location. For example,

cp target/release/ontodev_valve ~/bin/

Finally, run

ontodev_valve --help

to verify that Valve was installed correctly.

Configuration

Valve is configured primarily using a number of special configuration tables that can be represented as '.tsv' files. The most important of these is the table called 'table', also known as the table table. A table table configuration is required to use Valve. A column table and datatype table configuration are required as well. Optionally, the user may also specify a rule table configuration.

The table table is alone among the configuration tables in that it cannot be given an arbitrary name but must always be given the name 'table'. This is not the case for the column, datatype and rules tables. Although it is recommended to use the names 'column, 'datatype', and 'rule', respectively, alternate names may be chosen for these tables as explained below.

The table table

One normally configures the table table using a '.tsv' file called 'table.tsv', although any filename could be used in principle, and in fact it is also possible to read the table table directly from the database, as long as the database already contains a table called 'table' with the right setup.

Below is an example of a table table:

table	path	type	options
table	schema/table.tsv	table
column	schema/column.tsv	column
datatype	schema/datatype.tsv	datatype
rule	schema/rule.tsv	rule
user_table1	schema/user/user_table1.tsv
user_readonly_table1	schema/user/user_readonly_table1.tsv		no-edit no-save no-conflict
user_view1	schema/user/user_view1.sql		db_view
user_view2	schema/user/user_view2.sh		db_view
user_view3			db_view

Note that in the first row above the table being described is the table table itself. In general the columns of the table table have the following significance:

table: the name of the table.
path: where to find information about the contents of the table (see below). Note that the path of a special configuration tables must be a '.tsv' file.
description: An optional description of the contents and/or the purpose of the table.
type: Valve recognizes four special configuration table types that can be specified using the type column of the table table. These are the table, column, datatype, and rule table types. Data tables (e.g., the 'user_*' tables in the above example) should not explicitly specify a type, and in general if a type other than the ones just mentioned is specified, Valve will exit with an "Unrecognized table type" error.
options (optional column): Allows the user to specify a number of further options for the table (see below).

Further information on path

The path column indicates where the data for a given table may be found. It can be (a) a '.tsv' file, (b) a '.sql' file, (c) some other executable file, or (d) it may be empty. In each case it will have the following consequences for the possible values of type and and for the possible options that may be used with the table.

If path ends in '.tsv':
- Its associated type may be any one of table, column, datatype, rule, or it may be empty.
- The db_view option is not allowed. If set to true, Valve will fail with a "'.tsv' files are not supported for views" error.
If path does not end in '.tsv':
- Its associated type must be empty. Note that this implies that Valve's configuration tables must be specified using '.tsv' files.
- The edit option is not allowed. If set to true, Valve will fail with an "Editable tables require a path that ends in '.tsv'" error.
If path either ends in '.sql', or represents a generic executable:
- If the db_view option is set, Valve assumes that the '.sql' file or generic executable indicated contains the instructions necessary to create the view and that the view displays the appropriate data.
- If the db_view option is not set, Valve will take care of creating the table but it will expect that the '.sql' file or generic executable contains the statements necessary to load the data.
- If path ends in '.sql', Valve reads in the statements contained in the '.sql' file and then executes them against the database.
- If path represents a generic executable, then it is executed as a shell process. Note that the executable script or binary must accept two arguments indicating the location of the database and the name of the table that the path corresponds to.
If path is empty:
- If the db_view option is set, Valve assumes that the view already exists in the database.
- If the db_view option is not set, Valve will take care of creating the table but it will not attempt to load it.

The above is conveniently summarized in the following table:

path	possible types	possible options	created by	loaded by†
Ends in '.tsv'	any valid type, or empty	db_view not allowed	Valve	Valve
Ends in '.sql'	must be empty	edit not allowed	db_view: the '.sql' file, db_table: Valve	Valve
Generic executable	must be empty	edit not allowed	db_view: the executable, db_table: Valve	Valve
empty	must be empty	edit not allowed	No one (assumed to already exist)	No one (assumed to be already loaded)

† Note that loading is only applicable when the db_view option has not been set.

Further information on options

If no options are specified, the options db_table, truncate, load, save, edit, validate_on_load, and conflict will all be set to true by default. The complete list of allowable options, and their meanings, are given below:

db_table: The table will be represented in the database by a regular table. Note that if the edit option has also been set, then path must be non-empty and it can be one of (a) a file ending (case insensitively) with '.tsv' explicitly specifying the contents of the table, (b) a file ending (case insensitively) with '.sql', or (c) an executable file. In the case of (a), Valve will take care of validating and loading the data from the '.tsv' file. In cases (b) and (c), Valve uses the '.sql' file or the executable to load the data, and performs no validation other than to verify that this has been done without error. If path is empty, then Valve expects that the table has already been created and loaded by the user and will fail otherwise. Note that the db_table option is not compatible with the db_view option.
db_view: The table will be represented in the database by a view. A view is allowed to have an empty path, in which case valve simply verifies that the view exists but does not create it. When path is non-empty then it must either end (case-insensitively) in '.sql' or be an executable file. Note that the path for a view must never end (case-insensitively) in '.tsv'. When path is not-empty, valve relies on the '.sql' file or the executable to create the view, otherwise it expects the view to have already been created by the user. Regardless of the value of path, Valve performs no validation of the data other than to verify that the view exists in the database. Note that the db_view option is not compatible with any of db_table, truncate, load, conflict, save, edit, or validate_on_load.
load: The table may be loaded with data. Note that this option is not compatible with db_view.
truncate: The table should be truncated before it is loaded. Note that this option is not compatible with db_view.
conflict: When reading the configuration for a table called 'T', then if 'T' has the conflict option set, Valve should create, in addition to 'T', a database table called 'T_conflict'. The purpose of this table is to store invalid data that, due to the violation of a database constraint, cannot be stored in 'T' (see Design and concepts). Note that the conflict option is not compatible with db_view.
internal: This option is reserved for internal use and is not allowed to be specified by the user.
validate_on_load: When set, Valve will validate a table's rows before loading them to the database. Note that this option is not compatible with db_view.
edit: When set, this option indicates that it is allowed to edit the table after it has been initially loaded. Note that this option is not compatible with db_view.
save: When set, this indicates that overwriting the table's source file, as indicated by path (which must be a '.tsv' file) is allowed. If the save option is not set, the table may nevertheless be saved as long as it is to a different path. Note that the save option is not compatible with db_view.
no-conflict: Sets the conflict option (which is set to true by default unless db_view is true) to false.
no-validate_on_load: Sets the validate_on_load option (which is set to true by default unless db_view is true) to false.
no-edit: Sets the edit option (which is set to true by default unless db_view is true) to false.
no-save: Sets the save option (which is set to true by default unless db_view is true) to false.

Commonly used path and option combinations

Here are some examples of commonly used table table configurations:

table	path	description	type	options
plain	src/data/plain.tsv	A database table with default options set, backed by a '.tsv' file
appendonly	src/data/appendonly.tsv	A database table which is appended to but never truncated		no-truncate
readonly1	src/data/readonly1.tsv	A read-only table that is backed by a '.tsv' file		no-edit no-save
readonly2	src/data/readonly2.sh	A read-only table whose contents are assumed to be valid, backed by a generic executable		no-edit no-conflict no-validate-on-load
view1	src/data/view1.sql	A view backed by a '.sql' file		db_view
table	src/table.tsv	the table table	table
column	src/column.tsv	the column table	column
datatype	src/datatype.tsv	the datatype table	datatype
rule	src/rule.tsv	the rule table	rule

The column table

In addition to a table table, Valve also requires a column table. The column table configuration is normally stored in a file called 'column.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'column' in the table table. The column table contains one row for every column of every configured table. This includes both special configuration tables (such as the column table itself) and user-defined tables.

Below is an example column table, with the special configuration tables omitted:

table	column	label	nulltype	default	datatype	structure
table1	column1	Column 1		value1
table1	column2	Column 2	empty		integer	from(table2.column2)
table2	column1	Column 1			trimmed_line	primary
table2	column2	Column 2			integer	unique
table3	column1	Column 1			word	primary
table3	column2	Column 2	empty		word	tree(column1)

The columns of the column table have the following significance:

table: The name of the table to which the column belongs
column: The name of the column
label: If not empty, then instead of column, use label as the header for the column when saving the table.
nulltype: The datatype, defined in the datatype table, used to represent a null value in the column. For instance the datatype 'empty' in the example above is defined to match the string '' (see the example from the section on the datatype table). If a column has a nulltype, then values of the column that match the nulltype's associated datatype are considered to be valid values for the column. If a column has no nulltype, this means that a null value (by default, an empty string) for the column is considered to be an invalid value.
default (optional column): The default value to use for the column when inserting a row of data to the database. Note that in the database this implies that a DEFAULT constraint will be declared for the column.
datatype: The column's datatype, which must be one of the valid datatypes defined in the the datatype table
structure: Valve recognises the following four structural constraints on columns:
- primary: The column is the primary key for the table to which it belongs; values must therefore be unique. Note that in the database this implies that a PRIMARY KEY constraint will be declared for the column.
- unique: The column's values must be unique. Note that in the database this implies that a UNIQUE constraint will be declared for the column.
- from(foreign_table.foreign_column): All non-null values of the column must exist in the column foreign_column of the table foreign_table. Note that in the database this implies that a FOREIGN KEY constraint will be declared for the column, unless the column's datatype is a list datatype (see the datatype table), and it also implies that a UNIQUE constraint will be declared for foreign_table.foreign_column, unless a unique structure has already been declared for that column in the column table.
- tree(column_name): All non-null values of the column must exist in the column column_name of the same table
description: A description of the contents and/or the purpose of the column.

The datatype table

In addition to the table table and the column table, Valve also requires the user to configure a datatype table. The datatype table configuration is normally stored in a file called 'datatype.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'datatype' in the table table. The datatype table stores the definitions of the datatypes referred to in the datatype column of the column table.

Below is a subset of the rows of an example datatype table:

datatype	parent	condition	description	sql_type	HTML type	format
text			any text	TEXT	textarea
integer	text	match(/-?\d+/)	a positive or negative decimal digit, or 0	INTEGER		%i
word	text	exclude(/\W/)	a single word: letters, numbers, underscore			%s
empty	text	equals('')	the empty string	NULL
custom1	text	match(/\S+:\S+/)	two nonspace character sequences separated by ':'
custom2	text	in(alice, bob, cindy)	either 'alice', 'bob', or 'cindy'
custom3	text	list(word, ' ')	a list of words separated by spaces
custom4	text	search(/\d+/)	a string containing a sequence of digits

The columns of the datatype table have the following significance:

datatype: The name of the datatype
parent: The more generic datatype, if any, that this datatype is a special case of. When the value of a given column violates its datatype condition, Valve will move up the datatype hierarchy to determine whether its ancestors' datatype conditions have also been violated, and if so, Valve will add validation error messages corresponding to these further violations to the error messages it assigns to that value in that column.
condition: The logical condition used to validate whether a given data value conforms to the datatype (see condition types below)
description: A description of the datatype and/or its purpose.
sql_type: The SQL type to use for columns that have the given datatype in the database. If empty, the SQL type of the nearest ancestor for which a SQL type has been defined will be used.
HTML type (optional column): The HTML type corresponding to the datatype.
format (optional column): The sprintf-style format string to apply to values of the datatype when saving them.

Condition types

match(/REGEX/): Violated if a given value does not match REGEX.
exclude(/REGEX/): Violated if a given value contains an instance of REGEX.
search(/REGEX/): Violated if a given value does not contain an instance of REGEX.
equals(VAL): Violated if a given value is not equal to VAL.
in(VAL1, ...): Violated if a given value is not one of the values in the list: VAL1, ...
list(ITEM_DATATYPE, SEPARATOR): Violated if a given value is not in the form of a sequence of items, each of datatype ITEM_DATATYPE, separated by the string SEPARATOR. Otherwise the condition is violated if any of the items in the given list fail to conform to ITEM_DATATYPE.

Required datatypes

Valve requires that the following datatypes be defined:

text, empty, line, trimmed_line, nonspace, word

The recommended datatype configurations for these four datatypes are the following (note that the HTML type and format columns are optional):

datatype	parent	condition	sql_type	HTML type	format
text			TEXT	textarea
empty	text	equals('')	NULL
line	text	exclude(/\n/)		input
trimmed_line	line	match(/\S([^\n]\S)/)
nonspace	trimmed_line	exclude(/\s/)
word	nonspace	exclude(/\W/)			%s

The rule table

In addition to the table table, the column table, and the datatype table, it is also possible (but optional) to configure a table of type 'rule', or a rule table. When it is configured, the rule table configuration is normally stored in a file called 'rule.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'rule' in the table table.

The rule table is used to define a number of rules of the following form:

when CONDITION_ON_COLUMN_1 is satisfied then CONDITION_ON_COLUMN_2 must also be satisfied (1)

where CONDITION_ON_COLUMN_1 and CONDITION_ON_COLUMN_2 are defined with respect to the same table.

Below is an example rule table:

table	when column	when condition	then column	then condition	level	description
table1	foo	null	bar	not null	error
table2	foo	negative_int	bar	positive_int	error

The columns of the rule table have the following significance:

table: The name of the table to which the rule is applicable.
when column: The column that the when condition will be checked against.
when condition: The condition to apply to values of when column. This can either be one of the recognized condition types, or it can be the name of a datatype in which case the condition corresponding to the datatype will be used.
then column: The column that the then condition will be checked against.
then condition: The condition to apply to values of then column whenever the when condition has been satisfied for when column. This can either be one of the recognized condition types, or it can be the name of a datatype in which case the condition corresponding to the datatype will be used.
level: The severity of the violation
description: A description of the rule and/or its purpose.

Using guess

In some cases it is useful to be able to try and guess what the table table and column table configuration should be, using information about the current state of the Valve instance, for a given data table not currently managed by Valve. To do this one may use Valve's command line interface to run the guess subcommand as follows:

ontodev_valve guess [OPTIONS] SOURCE DATABASE TABLE_TSV

where:

SOURCE is the location of the '.tsv' representing the table table.
DATABASE is the path to a PostgreSQL or SQLite database.
TABLE_TSV is the '.tsv' file representing the data table whose column configuration is to be guessed.

For the list of possible options, and for general information on Valve's command line interface, see command line usage.

Below is an example of using guess:

$ ontodev_valve guess test/guess_test_data/table.tsv build/valve_guess.db test/guess_test_data/ontology/table2.tsv 

The following row will be inserted to "table":
table                                     path                                      type                                      description                               
table2                                    test/guess_test_data/ontology/table2.tsv                                                                                      

The following row will be inserted to "column":
table               column              label               nulltype            datatype            structure           description         
table2              zork                                                        integer             primary                                 
table2              zindy                                                       IRI                 unique                                  
table2              xyzzy                                                       suffix              unique                                  
table2              foo                                                         prefix                                                      
table2              bar                                                         IRI                 unique                                  

Do you want to write this updated configuration to the database? [y/N] y

Command line usage

The basic syntax when calling Valve on the command line is:

ontodev_valve [OPTIONS] <SUBCOMMAND <SUBCOMMAND POSITIONAL PARAMETERS>>

To view the list of possible subcommands and global options, run:

ontodev_valve --help

To get help on a particular subcommand, and on the positional parameters and options that are specific to it, run:

ontodev_valve SUBCOMMAND --help

Logging

By default Valve only logs error messages. To also enable warning and information messages, set the environment variable RUST_LOG to the minimum logging level desired for ontodev_valve: debug, info, warn, or error. For instance:

export RUST_LOG="ontodev_valve=info"

For further information see the Rust Cookbook.

API

The API (application-programmer interface) reference documentation for the latest release of Valve can normally be found on docs.rs. In case you would like to generate the API documentation for a newer version of Valve, run:

cargo doc

in the the root folder of your local copy of the valve.rs source code repository (see Installation and configuration), and then open the file

<valve.rs root folder>/target/doc/ontodev_valve/index.html

in your favourite browser. For a higher level See also the Design and concepts section.

Troubleshooting

TODO.

Python bindings

See valve.py

License: BSD-3-Clause

ontodev / valve.rs

readme