Previous VALVE tools output "standard message tables" that locate a cell and then describe validation results. General documentation is here https://github.com/ontodev/cogs#message-tables. We want to output something like this for use in reporting and quality control. I've found these tables very useful for summarizing problems with a dataset, and filtering for just the most important problems.
The validation information we need is already generated by the validation methods. We basically want to iterate over rows and collect validation messages from the _meta columns. I've written prototypes that do this in pure SQL.
I think the main question is how to specify cell locations.
In VALVE and COGS we populate the 'cell' column using "A1" syntax (i.e. column letter, row number) to locate the cell we want to talk about. Because VALVE and COGS are focused on spreadsheets, this is natural. Now that we're using SQL this is not as natural:
SQL uses column names instead of letters
SQL tables are multisets (technically): sets, with no order, allowing duplicates
most SQL implementations have a way to track row order, e.g. SQLite row_number()
primary keys are a better way to locate rows (I think), when available
We could map column names to letters. We want to have deterministic sort for all tables, as discussed in #36, so we should always be able to determine a row number, although I'm a little worried about a performance hit. I think we want the option to output a 'cell' column in A1 syntax.
But I think that what we usually want will be two columns, 'column' and 'row', where 'column' is the column name, and 'row' is the primary key for that row, or the row index if we don't have a primary key. If we have multiple primary keys, then I would want to concatenate them with a separator, although it's hard to pick a separator that won't conflict with the content of the key.
Previous VALVE tools output "standard message tables" that locate a cell and then describe validation results. General documentation is here https://github.com/ontodev/cogs#message-tables. We want to output something like this for use in reporting and quality control. I've found these tables very useful for summarizing problems with a dataset, and filtering for just the most important problems.
The validation information we need is already generated by the validation methods. We basically want to iterate over rows and collect validation messages from the _meta columns. I've written prototypes that do this in pure SQL.
I think the main question is how to specify cell locations.
In VALVE and COGS we populate the 'cell' column using "A1" syntax (i.e. column letter, row number) to locate the cell we want to talk about. Because VALVE and COGS are focused on spreadsheets, this is natural. Now that we're using SQL this is not as natural:
row_number()
We could map column names to letters. We want to have deterministic sort for all tables, as discussed in #36, so we should always be able to determine a row number, although I'm a little worried about a performance hit. I think we want the option to output a 'cell' column in A1 syntax.
But I think that what we usually want will be two columns, 'column' and 'row', where 'column' is the column name, and 'row' is the primary key for that row, or the row index if we don't have a primary key. If we have multiple primary keys, then I would want to concatenate them with a separator, although it's hard to pick a separator that won't conflict with the content of the key.