jamesaoverton / cmi-pb-terminology

CMI-PB Controlled Terminology
0 stars 0 forks source link

Output standard message tables #37

Closed jamesaoverton closed 2 years ago

jamesaoverton commented 2 years ago

Previous VALVE tools output "standard message tables" that locate a cell and then describe validation results. General documentation is here https://github.com/ontodev/cogs#message-tables. We want to output something like this for use in reporting and quality control. I've found these tables very useful for summarizing problems with a dataset, and filtering for just the most important problems.

The validation information we need is already generated by the validation methods. We basically want to iterate over rows and collect validation messages from the _meta columns. I've written prototypes that do this in pure SQL.

I think the main question is how to specify cell locations.

In VALVE and COGS we populate the 'cell' column using "A1" syntax (i.e. column letter, row number) to locate the cell we want to talk about. Because VALVE and COGS are focused on spreadsheets, this is natural. Now that we're using SQL this is not as natural:

  1. SQL uses column names instead of letters
  2. SQL tables are multisets (technically): sets, with no order, allowing duplicates
    • most SQL implementations have a way to track row order, e.g. SQLite row_number()
    • primary keys are a better way to locate rows (I think), when available

We could map column names to letters. We want to have deterministic sort for all tables, as discussed in #36, so we should always be able to determine a row number, although I'm a little worried about a performance hit. I think we want the option to output a 'cell' column in A1 syntax.

But I think that what we usually want will be two columns, 'column' and 'row', where 'column' is the column name, and 'row' is the primary key for that row, or the row index if we don't have a primary key. If we have multiple primary keys, then I would want to concatenate them with a separator, although it's hard to pick a separator that won't conflict with the content of the key.

lmcmicu commented 2 years ago

Note that the output should be to a .tsv file.

lmcmicu commented 2 years ago

Closed by #42