Valve - A lightweight validation engine written in rust.
ToC Generated using markdown-toc
Valve reads in the contents of user-defined data tables and represents them using a spreadsheet-like structure incorporating annotated cells that may subsequently be edited and saved. The annotations are of two kinds:
After reading and optionally validating the initial data, the validated information is loaded into a database that can then be queried directly using your favourite SQL client (e.g., sqlite3 or psql) as in the examples below. The currently supported databases are SQLite and PostgreSQL. Valve further provides an Application-programmer interface (API) that can be used to incorporate Valve's data validation and manipulation functionality and to visualize the validated data within your own rust project. For an example of a rust project that uses Valve, see ontodev/nanobot.rs. Finally, Valve also provides a Command line interface (CLI) for some of its more important functions like load
, save
, and guess
.
Below is an example of one of the views Valve provides into a data table in which three of the five rows contain invalid data, and one row has been edited since it was initially loaded. The output is from a psql session connected to a PostgreSQL database called 'valve_postgres'.
valve_postgres=> select * from table11_text_view
row_number | row_order | message | history | rank | name | employee_id | ||
---|---|---|---|---|---|---|---|---|
1 | 1000 | Eastern division head | Denise Laviolette | denise_l | ||||
2 | 2000 | Southwest division head | Eddie Cahill | eddie_c | ||||
3 | 3000 | [{"column":"employee_id","value":"denise_l","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] |
[[{"column":"rank","level":"update","message":"Value changed from 'intern' to 'Chief executive officer'","old_value":"intern","value":"Chief executive officer"}]] |
Chief executive officer | Denise Laurier | denise_l | c | f |
4 | 4000 | [{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] |
Development team lead | Edward Coulombe | eddie_c | |||
5 | 5000 | [{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] |
Test team lead | Ed Crantivera | eddie_c |
(5 rows)
Here, row_number is a fixed identifier assigned to each row at the time of creation, while row_order, which in principle might change multiple times throughout the lifetime of a given row, is used to manipulate the logical order of rows in the table. The message column, in this example, conveys that the value 'denise_l' of the column "employee_id" in row 3, and the value 'eddie_c' of the same column in rows 4 and 5, are duplicates of already existing values of that column, which happens to be a primary key, and are therefore invalid. We can also see, from the history column, that there has been one change, in which one of the values in row 3, namely the value of the column "rank", has been changed from 'intern' to 'Chief executive officer'. Finally, the names of the columns to the right of the history column correspond to the column names of the source table and will therefore vary from table to table. Normally these column names are specified in the header of a '.tsv' file from which the source data is read, though see below for alternate input data formats and associated table options. In any case the data contained in the columns to the right of the history column will exactly match the contents of the source table unless the data has been edited since it was initially loaded.
For the example below we will assume that a file named artists.tsv
exists on your hard disk in your current working directory with the following contents:
name | type | number_of_members | health_insurance_provider | health_insurance_id | health_insurance_id_suffix |
---|---|---|---|---|---|
The Jimi Hendrix Experience | band | 3 | Pittsfield Medical | 9834564422 | |
Cream | band | 3 | Blue Cross | 9388883311 | XX54 |
Jennifer Lopez | solo | Medi-Assist | MA67886666881 | ||
Janice Joplin | solo | Pittsfield Medical | FFFHYZDFJ432 | ||
Chrissie Hynde | solo | Blue Cross | 4422393877 | ||
Van Halen | band | 5 | Blue Cross | 9476587117 | BBDC |
Van Morrison | solo | Medi-Assist | MA67920004571 | ||
Paul McCartney | solo | Medi-Assisr | MA60768763987 | ||
The Band | band | five | Blue Cross | 0831133887 | |
Bob Dylan | solo | Pittsfield Medical | FFF GYU ZKJ 954 | ||
Van Halen | band | Blue Cr. | 9476587117 |
We will also assume that a file named providers.tsv
exists on your hard disk in your current working directory with the following contents:
name | address |
---|---|
Blue Cross | 123 Fake Street, Fake Town, USA, 55123 |
Medi-Assist | 933 Phoney Boulevard, Accra, Ghana, GA008 |
Pittsfield Medical | 510 North Street, Pittsfield, MA, 01201 |
In order for Valve to read this table it must first be configured to do so. This is done using a number of special data tables called configuration tables, usually represented using further '.tsv' files, that contain information about:
For our example we will assume that Valve's configuration tables contain the following entries:
table | path | description | type | options |
---|---|---|---|---|
artists | artists.tsv | |||
providers | providers.tsv |
table | column | label | nulltype | default | datatype | structure | description |
---|---|---|---|---|---|---|---|
artists | name | trimmed_line | primary | ||||
artists | type | trimmed_line | |||||
artists | number_of_members | empty | integer | ||||
artists | health_insurance_provider | trimmed_line | from(providers.name) | ||||
artists | health_insurance_id | nonspace | |||||
artists | health_insurance_id_suffix | empty | word | ||||
providers | name | trimmed_line | primary | ||||
providers | address | text |
datatype | parent | condition | description | sql_type |
---|---|---|---|---|
integer | nonspace | match(/-?\d+/) | INTEGER | |
nonspace | trimmed_line | exclude(/\s/) | ||
trimmed_line | line | match(/\S([^\n]\S)/) | ||
line | text | exclude(/\n/) | ||
word | nonspace | exclude(/\W/) | ||
empty | text | equals('') | ||
text | TEXT |
table | when_column | when_condition | then_column | then_condition | level | description |
---|---|---|---|---|---|---|
artists | type | equals(band) | number_of_members | not null | error | a band must specify the number of members |
artists | health_insurance_provider | equals('Blue Cross') | health_insurance_id_suffix | not null | error | a health insurance id suffix is required for Blue Cross members |
artists | health_insurance_provider | equals('Pittsfield Medical') | health_insurance_id | word | error | a Pittsfield Medical ID must consist in a single word |
For the meanings of all of the columns in the configuration tables above, see the section on configuration. In the rest of this section we'll refrain from explaining the meaning of a particular configuration table column unless and until it becomes relevant to our example. What is relevant at this point is only that each configuration table is also a data table whose contents are themselves subject to validation by Valve. In other words Valve will not necessarily fail to run if there are errors in its configuration (as long as those errors aren't critical) and it can moreover help to identify what those errors are. The upshot is that almost everything I mention below regarding artists
also applies to the special configuration tables table
, column
, datatype
and rule
unless otherwise noted.
Once it has been read in by Valve from its source file, a given logical table will be represented, in the database, by between one and two database tables and by as many as two database views. In our example, the source data, contained in the file 'artists.tsv', represents (according to the table table configuration) a normal data table with the default options set, which means that all four database tables and views will be created. These are:
artists
: The database table, having the same name as the logical table, that normally contains the bulk of the table's data.artists_conflict
(only when the conflict option is set): Valve aims to represent user data whether or not that data is valid. When it is not valid, however, this presents an obstacle when it comes to representing it in a relational database like SQLite or PostgreSQL. For instance, if a table contains a primary key column, then data rows containing values for the column that already exist in the table should be marked as invalid, but still somehow represented by Valve. However the implied database constraint on the database table will prevent us from inserting duplicate values to a primary key column. A similar issue arises for unique and foreign key constraints. To get around the database limitations that are due to database constraints such as primary, unique, or foreign keys, Valve constructs a _conflict
table that is identical to the normal version of the table, but that does not include those keys. In our example, any rows containing duplicate values of the primary key column will be inserted to the database table called artists_conflict
, while the other rows will be inserted to artists
.artists_view
(only when the conflict option is set): When Valve validates a row from the logical table, artists
, the validation messages it generates are not stored in the database table called artists
but instead in a special internal database table called message
. Similarly, when Valve adds, updates, or deletes a row from artists
, a record of the change is not stored in artists
, but in a special internal database table called history
. Because it can be convenient for information about the validity of a particular cell to be presented side by side with the value of the cell itself, Valve constructs a table called artists_view
which combines information from artists
and artists_conflict
with the message
and history
tables.artists_text_view
(only when the conflict option is set): In addition to the restrictions associated with primary, unique, and foreign key constraints, the database will also not allow us to represent values that are invalid due to an incorrect SQL type, in particular when one attempts to insert a non-numeric string value to a numeric column of a database table. To get around this limitation, Valve constructs a database view called artists_text_view
. Unlike artists_view
, which is defined such that the datatype of each column in artists_view
exactly matches the datatype of the corresponding datatype in artists
, in artists_text_view
all of the columns are cast to TEXT so that no SQL datatype errors can occur when representing the data.After loading the data from artists.tsv
into the database, these four tables and views will be found to have the following contents:
valve_postgres=> select * from artists;
row_number | row_order | name | type | number_of_members | health_insurance_provider | health_insurance_id | health_insurance_id_suffix |
---|---|---|---|---|---|---|---|
1 | 1000 | The Jimi Hendrix Experience | band | 3 | Pittsfield Medical | 9834564422 | |
2 | 2000 | Cream | band | 3 | Blue Cross | 9388883311 | XX54 |
3 | 3000 | Jennifer Lopez | solo | Medi-Assist | MA67886666881 | ||
4 | 4000 | Janice Joplin | solo | Pittsfield Medical | FFFHYZDFJ432 | ||
6 | 6000 | Van Halen | band | 5 | Blue Cross | 9476587117 | BBDC |
7 | 7000 | Van Morrison | solo | Medi-Assist | MA67920004571 |
(6 rows)
valve_postgres=> select * from artists_conflict
row_number | row_order | name | type | number_of_members | health_insurance_provider | health_insurance_id | health_insurance_id_suffix |
---|---|---|---|---|---|---|---|
5 | 5000 | Chrissie Hynde | solo | Blue Cross | 4422393877 | ||
8 | 8000 | Paul McCartney | solo | Medi-Assisr | MA60768763987 | ||
9 | 9000 | The Band | band | Blue Cross | 0831133887 | ||
10 | 10000 | Bob Dylan | solo | Pittsfield Medical | FFF GYU ZKJ 954 | ||
11 | 11000 | Van Halen | band | 5 | Pittsfield Med. | 9476587117 |
(5 rows)
valve_postgres=> select * from artists_view order by row_number;
row_number | row_order | name | type | number_of_members | health_insurance_provider | health_insurance_id | health_insurance_id_suffix | message | history |
---|---|---|---|---|---|---|---|---|---|
1 | 1000 | The Jimi Hendrix Experience | band | 3 | Pittsfield Medical | 9834564422 | |||
2 | 2000 | Cream | band | 3 | Blue Cross | 9388883311 | XX54 | ||
3 | 3000 | Jennifer Lopez | solo | Medi-Assist | MA67886666881 | ||||
4 | 4000 | Janice Joplin | solo | Pittsfield Medical | FFFHYZDFJ432 | ||||
5 | 5000 | Chrissie Hynde | solo | Blue Cross | 4422393877 | [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}] | |||
6 | 6000 | Van Halen | band | 5 | Blue Cross | 9476587117 | BBDC | ||
7 | 7000 | Van Morrison | solo | Medi-Assist | MA67920004571 | ||||
8 | 8000 | Paul McCartney | solo | Medi-Assisr | MA60768763987 | [{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}] | |||
9 | 9000 | The Band | band | Blue Cross | 0831133887 | [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}] | |||
10 | 10000 | Bob Dylan | solo | Pittsfield Medical | FFF GYU ZKJ 954 | [{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}] | |||
11 | 11000 | Van Halen | band | 5 | Pittsfield Med. | 9476587117 | [{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}] |
(11 rows)
valve_postgres=> select * from artists_text_view order by row_number;
row_number | row_order | name | type | number_of_members | health_insurance_provider | health_insurance_id | health_insurance_id_suffix | message | history |
---|---|---|---|---|---|---|---|---|---|
1 | 1000 | The Jimi Hendrix Experience | band | 3 | Pittsfield Medical | 9834564422 | |||
2 | 2000 | Cream | band | 3 | Blue Cross | 9388883311 | XX54 | ||
3 | 3000 | Jennifer Lopez | solo | Medi-Assist | MA67886666881 | ||||
4 | 4000 | Janice Joplin | solo | Pittsfield Medical | FFFHYZDFJ432 | ||||
5 | 5000 | Chrissie Hynde | solo | Blue Cross | 4422393877 | [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}] | |||
6 | 6000 | Van Halen | band | 5 | Blue Cross | 9476587117 | BBDC | ||
7 | 7000 | Van Morrison | solo | Medi-Assist | MA67920004571 | ||||
8 | 8000 | Paul McCartney | solo | Medi-Assisr | MA60768763987 | [{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}] | |||
9 | 9000 | The Band | band | five | Blue Cross | 0831133887 | [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}] | ||
10 | 10000 | Bob Dylan | solo | Pittsfield Medical | FFF GYU ZKJ 954 | [{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}] | |||
11 | 11000 | Van Halen | band | 5 | Pittsfield Med. | 9476587117 | [{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}] |
(11 rows)
In the previous section we mentioned that artists_view
and artists_text_view
are defined in terms of a query that draws on data from artists
and artists_conflict
as well as from the special internal tables called message
and history
. The information in these tables can also be queried directly. After loading the data from artists.tsv
into the database, one will find that the message
table contains the following contents:
valve_postgres=> select * from message where "table" = 'artists';
message_id | table | row | column | value | level | rule | message |
---|---|---|---|---|---|---|---|
62 | artists | 5 | health_insurance_provider | Blue Cross | error | rule:health_insurance_provider-1 | a health insurance id suffix must be specified for Blue Cross members |
63 | artists | 8 | health_insurance_provider | Medi-Assisr | error | key:foreign | Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name |
64 | artists | 9 | number_of_members | five | error | datatype:integer | number_of_members should be a positive or negative integer |
65 | artists | 9 | health_insurance_provider | Blue Cross | error | rule:health_insurance_provider-1 | a health insurance id suffix must be specified for Blue Cross members |
66 | artists | 10 | health_insurance_provider | Pittsfield Medical | error | rule:health_insurance_provider-2 | a Pittsfield Medical health insurance id must be a single word |
67 | artists | 10 | health_insurance_id | FFF GYU ZKJ 954 | error | datatype:nonspace | health_insurance_id should be text without whitespace |
68 | artists | 11 | name | Van Halen | error | key:primary | Values of name must be unique |
69 | artists | 11 | health_insurance_provider | Pittsfield Med. | error | key:foreign | Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name |
(8 rows)
Here,
The history table will be empty immediately after the initial loading of the database.
valve_postgres=> select * from history where "table" = 'artists';
history_id | table | row | from | to | summary | user | undone_by | timestamp |
---|---|---|---|---|---|---|---|---|
(0 rows)
Here,
[
{
"column":"name",
"level":"update",
"message":"Value changed from 'Prince' to 'The artist formerly known as Prince'",
"old_value":"Prince",
"value":"The artist formerly known as Prince"
},
{
"column":"health_insurance_provider",
"level":"update",
"message":"Value changed from 'Blue Cross' to 'Medi-Assist'",
"old_value":"Blue Cross",
"value":"Medi-Assist"
},
{
"column":"health_insurance_id",
"level":"update",
"message":"Value changed from '9933991133' to 'MA0001122344'",
"old_value":"9933991133",
"value":"MA0001122344"
},
{
"column":"health_insurance_id_suffix",
"level":"update",
"message":"Value changed from 'GG77' to ''",
"old_value":"GG77",
"value":""
}
]
Note that the summary column of the history table is where the information in the view columns artists_view.history
and artists_text_view.history
is taken from, which is why these view columns are always in the form of an array of arrays, i.e., an array of summary records for the given row.
For the purposes of this section we will define a cell as a single value, that can either be valid or invalid, of one of the columns of a given data table. We will further define a row of data, from a given data table, to be an array of correlated cells, such that for every column in the table there is one and only one cell in the row corresponding to it. Once a given row of data has been inserted into the database and (optionally) validated, Valve represents it using a ValveRow
struct (see the API reference) containing the following information:
row_number
.ValveCell
struct associated with each, where the latter represents the results of running Valve's validation engine (see the validation process) on the cell associated with that column.In particular, each ValveCell
contains the following information:
ValveCellMessage
structs representing the validation messages associated with this cell.Each ValveCellMessage
, in turn, contains the following information:
The following is a textual representation of an example of a ValveRow
:
ValveRow {
row_number: Some(
11,
),
contents: {
"id": ValveCell {
nulltype: None,
value: String("BFO:0000027"),
valid: true,
messages: [],
},
"name": ValveCell {
nulltype: None,
value: String("Mike"),
valid: true,
messages: [],
},
"location": ValveCell {
nulltype: None,
value: String("baree"),
valid: false,
messages: [
ValveCellMessage {
level: "error",
rule: "key:foreign",
message: "Value 'baree' of column location is not in cities.city_name",
},
],
},
"preferred_seafood": ValveCell {
nulltype: Some("empty"),
value: String(""),
valid: true,
messages: [],
},
},
}
Valve uses the following to identify the rule that has been violated by a given cell value in the ValveCellMessage
associated with the violation:
from()
structure (see the column table) that references some column, F, in another table, but the given value is not in F.primary
structure, and the given value already exists in the column.unique
structure, and the given value already exists in the column.tree()
structure that references some other column, T, of the same table; but the given value is not in T.when_column
is COLUMN to be violated.For a given row of data in a given data table, Valve's validation process consists in a series of checks that are schematically represented in the flowchart below and subsequently explained in more detail.
flowchart TD
node1["1. Determine the nulltype (if any) of each cell in the row"]
node1 -- "Then, for each cell:" --> node2
node2["2. Validate rules"]
node2 --> modal1
modal1{"Does the cell have a nulltype?"}
modal1 -- No --> node3
modal1 -- Yes, skip further validation for this cell --> modal3
node3["3. Validate datatype"]
node3 --> modal2
modal2{"Does the cell value contain a SQL type error?"}
modal2 -- No --> node4
modal2 -- Yes --> modal3
node4["4. Validate foreign constraints"]
node4 --> node5
node5["5. Validate primary and unique Constraints"]
node5 --> modal3
modal3{"Have we iterated over all of the cells in the row?"}
modal3 -- Yes, then over the table as a whole: --> node6
modal3 -- No, go on to the next cell --> node2
node6["6. Validate tree-foreign keys"]
The validation process begins by determining, for each cell in the row, whether the value of that cell matches the nulltype (if any) of its associated column, as defined in the column table. In particular, if the value of the cell matches the nulltype of its associated column, then the nulltype
field of the ValveCell
struct used to represent the cell will be set to indicate that the value is a null value of that type. Otherwise the nulltype
field will remain unset, indicating that the value is not a null value. For instance, suppose that the cell value is '' (i.e., the empty string), and that the nulltype for its associated column, as defined in the column table, is empty
. Since empty
's associated condition, as defined in the datatype table is equals('')
, applying it to the cell value will result in a match, and Valve will set the nulltype
field for the ValveCell
representing this particular cell to empty
. In the case where the value of the cell does not match the condition associated with the datatype, empty
, (i.e., when the cell value is something other than an empty string), the validation process will leave the nulltype
field of the ValveCell
unset.
This step of the validation process determines whether any of the rules in the rule table that are applicable to a cell have been violated. A rule in the rule table is applicable to a cell when the cell's associated column is the same as the when_column associated with the rule. Note that since the rules in the rule table correspond to if-then conditionals, such that the antecedent and consequent of a given conditional refer (in general) to two distinct columns, a rule violation may indicate that there is a problem with the value of either or both. Whenever a rule violation occurs, a ValveCellMessage
struct is added to the list of messages associated with the cell, identifying the particular violation that occurred (see the section on rule violation IDs) and its associated level
and description
as found in the rule table.
This step of the validation process determines whether a cell's value violates the datatype condition, as defined in the datatype table, for the datatype associated with the cell's column in the column table. When a datatype violation occurs, a ValveCellMessage
struct is added to the list of messages associated with the cell, identifying the particular datatype violation that occurred (see the section on rule violation IDs). The text of the message is taken from the description of the datatype that has been violated (see the datatype table).
This step in the validation process verifies, for a given cell, that if the cell's associated column, C
, has been configured with a structure of the form from(T, F)
(see the column-table), where T
is a foreign table and F
is a column in T
, then the cell's value (or values if C
's datatype is a list datatype) is (are) among the values of F
. Note that if the foreign table has a _conflict
version (see the table table), then this function will distinguish between (a) the case in which a given value is not found in either the foreign table or its associated conflict table, and (b) the case in which a given value is found only in the conflict table. When a foreign constraint violation occurs, a ValveCellMessage
struct is added to the list of messages associated with the cell with the identifier key:foreign
(see also the section on rule violation IDs). The text of the message will be of the form: Value 'V' of column C is not in T.F
, whenever V
is neither found in the normal version of the table nor (if applicable) its conflict version, and it will be of the form Value 'V' of column C exists only in T_conflict.F
whenever V
exists in T_conflict
but not in T
.
This step in the validation process verifies, for a given cell, that if the cell's associated column has been configured with either a primary
or a unique
constraint (see the column-table), then the cell's value is not among the values of the column that have already been inserted into the table, neither in the normal version of the table nor in the conflict version (in the case where the conflict option has been set). When a primary or unique constraint violation occurs, a ValveCellMessage
struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Values of COLUMN must be unique
.
When a column, column2
has a structure tree(column1)
defined on it in the column table, then all non-null values of column2
must exist in column1
. This step in the validation process function verifies, for a given table, whether any of the values of any of the cells in any of the rows violate any of the table's tree()
conditions. When a primary or unique constraint violation occurs, a ValveCellMessage
struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Value 'VAL' of column COLUMN2 is not in COLUMN1
.
The algorithm described in the previous section is applicable to a single row of data. When initially loading the many rows of a data table from its source into the database, however, a number of optimizations may be used to speed up the validation process. In particular, some of the steps for validating a row's cells from the previous section do not need to refer to any data external to a given row, and may therefore be performed in parallel. These are steps 1–3, i.e., the nulltype, rules, and datatype validation steps. Only the remaining steps: foreign and unique/primary constraint validation, need be performed row by row.
Once a data table has been loaded into the database and validated, it may be desirable to further edit the data, for instance to resolve any outstanding validation messages, or for some other reason. The possible operations that can be performed on the data are insert_row()
, update_row()
, delete_row()
, and move_row()
. In addition the function validate_row()
is used to generate validation information about the row during the editing process.
Note that insert_row()
, update_row()
, and validate_row()
require that a row be specified in the following "simple" form:
{
"column_1": value1,
"column_2": value2,
...
},
The output of these three functions is a ValveRow
(see representing validated data).
After editing the data, a record of the change will be inserted to the history table. The ValveRow
representing the row as it was before the change (in the case of an update, delete, or move operation), and the ValveRow
representing it as it is after the change (in the case of an update, move, or insert), will be used as the values of the from and to columns, respectively. Note that to represent a ValveRow
in the database, "rich" JSON format is used:
{
"column_1": {
"valid": <true|false>,
"messages": [{"level": level, "rule": rule, "message": message}, ...],
"value": value1
},
"column_2": {
"valid": <true|false>,
"messages": [{"level": level, "rule": rule, "message": message}, ...],
"value": value2
},
...
},
Valve also provides the undo()
and redo()
functions, to undo the last insert, update, or delete operation, and to redo the last undo, respectively. For more information on the data manipulation and browsing operations provided by Valve see the section on the Valve API.
To save a data table or tables to a '.tsv' file, one uses the API functions, save_all_tables()
, save_tables()
, or save_table()
. Note that unless the save option has been set, it is not possible to overwrite the source file from which the table was originally loaded. One must save the table to an alternate location. For more information on saving data tables Valve see the section on the Valve API.
Option 1: Build and install the binary for the latest release using crates.io by running:
cargo install ontodev_valve
Option 2: Download the appropriate binary for your system from the release page and copy it to your system's executable path.
It is also possible to build the binary yourself from the source code. This is required if you would like to work with a version of Valve that is newer than the latest release. Begin by using git clone
to clone the source code repository into a local folder:
git clone git@github.com:ontodev/valve.rs.git
Next, use cargo
to build the ontodev_valve
binary (this may take awhile to complete and will produce a lot of output):
cargo build --release
After the build has completed successfully, the ontodev_valve
binary may be found (relative to the repository's root directory) in the target/release/
subdirectory. You should copy or move it from here to your user or system-wide bin/
directory, or to some other convenient location. For example,
cp target/release/ontodev_valve ~/bin/
Finally, run
ontodev_valve --help
to verify that Valve was installed correctly.
Valve is configured primarily using a number of special configuration tables that can be represented as '.tsv' files. The most important of these is the table called 'table', also known as the table table. A table table configuration is required to use Valve. A column table and datatype table configuration are required as well. Optionally, the user may also specify a rule table configuration.
The table table is alone among the configuration tables in that it cannot be given an arbitrary name but must always be given the name 'table'. This is not the case for the column, datatype and rules tables. Although it is recommended to use the names 'column, 'datatype', and 'rule', respectively, alternate names may be chosen for these tables as explained below.
One normally configures the table table using a '.tsv' file called 'table.tsv', although any filename could be used in principle, and in fact it is also possible to read the table table directly from the database, as long as the database already contains a table called 'table' with the right setup.
Below is an example of a table table:
table | path | description | type | options |
---|---|---|---|---|
table | schema/table.tsv | table | ||
column | schema/column.tsv | column | ||
datatype | schema/datatype.tsv | datatype | ||
rule | schema/rule.tsv | rule | ||
user_table1 | schema/user/user_table1.tsv | |||
user_readonly_table1 | schema/user/user_readonly_table1.tsv | no-edit no-save no-conflict | ||
user_view1 | schema/user/user_view1.sql | db_view | ||
user_view2 | schema/user/user_view2.sh | db_view | ||
user_view3 | db_view |
Note that in the first row above the table being described is the table table itself. In general the columns of the table table have the following significance:
table
, column
, datatype
, and rule
table types. Data tables (e.g., the 'user_*' tables in the above example) should not explicitly specify a type, and in general if a type other than the ones just mentioned is specified, Valve will exit with an "Unrecognized table type" error.The path column indicates where the data for a given table may be found. It can be (a) a '.tsv' file, (b) a '.sql' file, (c) some other executable file, or (d) it may be empty. In each case it will have the following consequences for the possible values of type and and for the possible options that may be used with the table.
If path ends in '.tsv':
table
, column
, datatype
, rule
, or it may be empty.If path does not end in '.tsv':
If path either ends in '.sql', or represents a generic executable:
If path is empty:
The above is conveniently summarized in the following table:
path | possible types | possible options | created by | loaded by† |
---|---|---|---|---|
Ends in '.tsv' | any valid type, or empty | db_view not allowed | Valve | Valve |
Ends in '.sql' | must be empty | edit not allowed | db_view: the '.sql' file, db_table: Valve | Valve |
Generic executable | must be empty | edit not allowed | db_view: the executable, db_table: Valve | Valve |
empty | must be empty | edit not allowed | No one (assumed to already exist) | No one (assumed to be already loaded) |
† Note that loading is only applicable when the db_view option has not been set.
If no options are specified, the options db_table, truncate, load, save, edit, validate_on_load, and conflict will all be set to true by default. The complete list of allowable options, and their meanings, are given below:
Here are some examples of commonly used table table configurations:
table | path | description | type | options |
---|---|---|---|---|
plain | src/data/plain.tsv | A database table with default options set, backed by a '.tsv' file | ||
appendonly | src/data/appendonly.tsv | A database table which is appended to but never truncated | no-truncate | |
readonly1 | src/data/readonly1.tsv | A read-only table that is backed by a '.tsv' file | no-edit no-save | |
readonly2 | src/data/readonly2.sh | A read-only table whose contents are assumed to be valid, backed by a generic executable | no-edit no-conflict no-validate-on-load | |
view1 | src/data/view1.sql | A view backed by a '.sql' file | db_view | |
table | src/table.tsv | the table table | table | |
column | src/column.tsv | the column table | column | |
datatype | src/datatype.tsv | the datatype table | datatype | |
rule | src/rule.tsv | the rule table | rule |
In addition to a table table, Valve also requires a column table. The column table configuration is normally stored in a file called 'column.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'column' in the table table. The column table contains one row for every column of every configured table. This includes both special configuration tables (such as the column table itself) and user-defined tables.
Below is an example column table, with the special configuration tables omitted:
table | column | label | nulltype | default | datatype | structure | description |
---|---|---|---|---|---|---|---|
table1 | column1 | Column 1 | value1 | ||||
table1 | column2 | Column 2 | empty | integer | from(table2.column2) | ||
table2 | column1 | Column 1 | trimmed_line | primary | |||
table2 | column2 | Column 2 | integer | unique | |||
table3 | column1 | Column 1 | word | primary | |||
table3 | column2 | Column 2 | empty | word | tree(column1) |
The columns of the column table have the following significance:
DEFAULT
constraint will be declared for the column.primary
: The column is the primary key for the table to which it belongs; values must therefore be unique. Note that in the database this implies that a PRIMARY KEY
constraint will be declared for the column.unique
: The column's values must be unique. Note that in the database this implies that a UNIQUE
constraint will be declared for the column.from(foreign_table.foreign_column)
: All non-null values of the column must exist in the column foreign_column
of the table foreign_table
. Note that in the database this implies that a FOREIGN KEY
constraint will be declared for the column, unless the column's datatype is a list datatype (see the datatype table), and it also implies that a UNIQUE
constraint will be declared for foreign_table.foreign_column
, unless a unique
structure has already been declared for that column in the column table.tree(column_name)
: All non-null values of the column must exist in the column column_name
of the same tableIn addition to the table table and the column table, Valve also requires the user to configure a datatype table. The datatype table configuration is normally stored in a file called 'datatype.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'datatype' in the table table. The datatype table stores the definitions of the datatypes referred to in the datatype column of the column table.
Below is a subset of the rows of an example datatype table:
datatype | parent | condition | description | sql_type | HTML type | format |
---|---|---|---|---|---|---|
text | any text | TEXT | textarea | |||
integer | text | match(/-?\d+/) | a positive or negative decimal digit, or 0 | INTEGER | %i | |
word | text | exclude(/\W/) | a single word: letters, numbers, underscore | %s | ||
empty | text | equals('') | the empty string | NULL | ||
custom1 | text | match(/\S+:\S+/) | two nonspace character sequences separated by ':' | |||
custom2 | text | in(alice, bob, cindy) | either 'alice', 'bob', or 'cindy' | |||
custom3 | text | list(word, ' ') | a list of words separated by spaces | |||
custom4 | text | search(/\d+/) | a string containing a sequence of digits |
The columns of the datatype table have the following significance:
match(/REGEX/)
: Violated if a given value does not match REGEX
.exclude(/REGEX/)
: Violated if a given value contains an instance of REGEX
.search(/REGEX/)
: Violated if a given value does not contain an instance of REGEX
.equals(VAL)
: Violated if a given value is not equal to VAL
.in(VAL1, ...)
: Violated if a given value is not one of the values in the list: VAL1, ...
list(ITEM_DATATYPE, SEPARATOR)
: Violated if a given value is not in the form of a sequence of items, each of datatype ITEM_DATATYPE
, separated by the string SEPARATOR
. Otherwise the condition is violated if any of the items in the given list fail to conform to ITEM_DATATYPE
.Valve requires that the following datatypes be defined:
text
, empty
, line
, trimmed_line
, nonspace
, word
The recommended datatype configurations for these four datatypes are the following (note that the HTML type
and format
columns are optional):
datatype | parent | condition | description | sql_type | HTML type | format |
---|---|---|---|---|---|---|
text | TEXT | textarea | ||||
empty | text | equals('') | NULL | |||
line | text | exclude(/\n/) | input | |||
trimmed_line | line | match(/\S([^\n]\S)/) | ||||
nonspace | trimmed_line | exclude(/\s/) | ||||
word | nonspace | exclude(/\W/) | %s |
In addition to the table table, the column table, and the datatype table, it is also possible (but optional) to configure a table of type 'rule', or a rule table. When it is configured, the rule table configuration is normally stored in a file called 'rule.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'rule' in the table table.
The rule table is used to define a number of rules of the following form:
when CONDITION_ON_COLUMN_1
is satisfied then CONDITION_ON_COLUMN_2
must also be satisfied (1)
where CONDITION_ON_COLUMN_1
and CONDITION_ON_COLUMN_2
are defined with respect to the same table.
Below is an example rule table:
table | when column | when condition | then column | then condition | level | description |
---|---|---|---|---|---|---|
table1 | foo | null | bar | not null | error | |
table2 | foo | negative_int | bar | positive_int | error |
The columns of the rule table have the following significance:
In some cases it is useful to be able to try and guess what the table table and column table configuration should be, using information about the current state of the Valve instance, for a given data table not currently managed by Valve. To do this one may use Valve's command line interface to run the guess subcommand as follows:
ontodev_valve guess [OPTIONS] SOURCE DATABASE TABLE_TSV
where:
SOURCE
is the location of the '.tsv' representing the table table.DATABASE
is the path to a PostgreSQL or SQLite database.TABLE_TSV
is the '.tsv' file representing the data table whose column configuration is to be guessed.For the list of possible options, and for general information on Valve's command line interface, see command line usage.
Below is an example of using guess:
$ ontodev_valve guess test/guess_test_data/table.tsv build/valve_guess.db test/guess_test_data/ontology/table2.tsv
The following row will be inserted to "table":
table path type description
table2 test/guess_test_data/ontology/table2.tsv
The following row will be inserted to "column":
table column label nulltype datatype structure description
table2 zork integer primary
table2 zindy IRI unique
table2 xyzzy suffix unique
table2 foo prefix
table2 bar IRI unique
Do you want to write this updated configuration to the database? [y/N] y
The basic syntax when calling Valve on the command line is:
ontodev_valve [OPTIONS] <SUBCOMMAND <SUBCOMMAND POSITIONAL PARAMETERS>>
To view the list of possible subcommands and global options, run:
ontodev_valve --help
To get help on a particular subcommand, and on the positional parameters and options that are specific to it, run:
ontodev_valve SUBCOMMAND --help
By default Valve only logs error messages. To also enable warning and information messages,
set the environment variable RUST_LOG
to the minimum logging level desired for ontodev_valve:
debug
, info
, warn
, or error
.
For instance:
export RUST_LOG="ontodev_valve=info"
For further information see the Rust Cookbook.
The API (application-programmer interface) reference documentation for the latest release of Valve can normally be found on docs.rs. In case you would like to generate the API documentation for a newer version of Valve, run:
cargo doc
in the the root folder of your local copy of the valve.rs source code repository (see Installation and configuration), and then open the file
<valve.rs root folder>/target/doc/ontodev_valve/index.html
in your favourite browser. For a higher level See also the Design and concepts section.
TODO.
See valve.py
License: BSD-3-Clause