ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.

Apache License 2.0

78 stars 34 forks source link

UCO should represent a database record at the record-cell level #415

Closed ajnelson-nist closed 1 year ago

ajnelson-nist commented 2 years ago

Background

This Issue stemmed from a question:

On the CASE Crossover Scenario, a tool's output identified an incorrect association with an ICCID number for a phone. The ICCID number is known to be incorrect because the person who seeded the phone data knows what the ICCID was.

https://caseontology.org/examples/crossover_wmd/

The ICCCID should be 8931088918010550289, whereas the tool claimed a value 89390100002217635543. This discrepancy comes from having selected an incorrect record in a SQLite file.

How should UCO designate a certain record within SQLite? SQLite has the advantage of being able to identify its allocated rows with row_number(). (I had thought there was a ROW_NO, but haven't had a chance to go back through the SQLite docs to confirm.)

https://www.sqlite.org/windowfunctions.html

Additional nuance for this example: This example relies on allocated records. Do we need a mechanism to identify unallocated records as well initially (especially for carving work)?

Requirements

Requirement 1

UCO must be able to represent an individual cell within a database table.

Requirement 2

UCO must be able to relate a cell within a database to its containing column within the row. Likewise for containing row within the table, table within the database, or table within the schema and schema within the database. Each of these relating properties must tie to the cell.

Requirement 3

At least in the initial representation, it must not be necessary to represent each layer between the whole database as an object, down to schema, down to table, down to row, down to column, down to cell. The cell object itself must be able to carry its locating characteristics.

Requirement 4

The cell contents must be able to represent strings, binary content, floating point numbers (xsd:decimal), integers, OR that the cell is NULL.

Risk / Benefit analysis

Benefits

One or more database row-column points can be the source of significant analytic conclusions. Being able to refer to one cell unambiguously is essential.

Risks

Databases, as compact representations of relational data, have an entire field of research devoted to optimal encodings and serializations. UCO's representation, as part of an RDF-based graph, will always feel suboptimal. Proactive characterization of all database cells within a UCO-based characterization operation (such as a CASE investigation) is likely to lead to an unwieldy graph size.

Competencies demonstrated

Competency 1

The Crossover "WMD" scenario has an issue where a tool drew an assignment for ICCID from either a system .dat file, or from a certain selection within a table that stored ICCID histories. (See the "SIM CARDS" section on that page.) The ground truth is known in this instance, and the value reported by that tool is incorrect.

Competency Question 1.1

What database cell provided the ICCID values 89390100002217635543 (known incorrect) and 8931088918010550289 (known in ground truth to be correct)?

Result 1.1

SELECT ?nRecord ?lTableName ?lRowID ?lFieldName ?lFieldValue
WHERE {
  ?nRecord
    drafting:databaseTable ?lTableName ;
    drafting:databaseRowID ?lRowID ;
    drafting:databaseFieldName ?lFieldName ;
    drafting:databaseFieldValue ?lFieldValue
    .

  FILTER ( ?lFieldValue IN ("89390100002217635543", "8931088918010550289") )
}

Solution suggestion

The solution is drafted in CASE-Examples' drafting.ttl here, except a property drafting:databaseFieldIsNull, Boolean-ranged, should be added as part of an sh:xone (exactly one) constraint incorporating constraints on drafting:databaseFieldValue .

Coordination

Tracking in Jira ticket OC-251
[x] Administrative review completed, proposal announced to Ontology Committees (OCs) on 2022-08-03
[x] Requirements to be discussed in OC meeting, 2022-08-09
[x] Requirements Review vote occurred, passing, on 2022-08-09
[x] Requirements development phase completed.
[x] Solution announced to OCs on 2022-11-07
[x] Solutions Approval to be discussed in OC meeting, 2022-11-07
[x] Decision on recordRowId range including strings or not is recorded
[x] Decision recorded on whether to change recordRowId definition to: The unique ID that identifies a database record, supplied by the originating database engine..
[x] Solutions Approval vote occurred, passing, on 2022-11-17
[x] Solutions development phase completed.
[x] Implementation merged into develop
[x] Milestone linked
[x] Documentation logged in pending release page

kchason commented 2 years ago

Drafting examples are being developed at: https://github.com/casework/CASE-Examples/tree/415-database-records/examples/illustrations/database_records

ajnelson-nist commented 2 years ago

CASE-Example development has been merged, and the example is now here. @kchason , could you please copy that drafting.ttl over to the website and add a snippet illustrating how it would be used to represent the two records?

The crossover page presents that row 3 (last row) in telephony.db is the correct ICCID assignment. Row 1 has a coinciding ICCID from the erroneous report drawn from SimCard.dat. Ground truth ( per @eoghanscasey , phone data populater) is that row 3 has the right ICCID number, from his knowledge of having put the SIM card in himself.

@kchason , can you please add rows 1 and 3 as a new code snippet with the drafting concepts, with appropriate descriptive narrative (could reword what I put above)?

I'll revise this Issue into a proposal now.

sbarnum commented 2 years ago

While the referenced examples contain anecdotal use of the new DatabaseRecordFacet and some properties, I believe the Solution Suggestion section of this CP should explicitly outline details of all proposed new classes and properties including property type and cardinality.

gwebb-case commented 2 years ago

I voted to accept the above, as I do agree with the majority of what is being proposed, but have the following main reservation in the definition drafting.ttl; and in the supporting example database_records.json :

The current class definitions for DatabaseRecord and DatabaseRecordFacet appears to describe a field not a record.

note: A Record contains a [ordered/unordered] collection of Fields, but the current definitions only describes a single field.

gwebb-case commented 2 years ago

A further reservation is that Databases do not directly have have records; or fields, assigned to them, they only have tables (of types Table, View, and Index), so semantically I think the current naming of a number of the classes and properties in the definition is likely to cause confusion for the following reasons:

A Database consists of 1 or more Tables, and it is not uncommon for a dB to have in excess of 300 tables, even in SQLite dBs,
Each table will have a single schema, which describes its original field types.
Tables have Records (Rows), which in turn contain a [ordered/unordered] collection of Fields.

Personally, I would much prefer that some time in the future the following simple semantic changes be made to more closely match what seems to be being described by the current draft ontology and its associated example:

Current -> Rename drafting:databaseTable -> drafting:tableName drafting:databaseSchema -> drafting:tableSchema

drafting:DatabaseRecord -> drafting:TableField drafting:DatabaseRecordFacet -> drafting:TableFieldFacet

drafting:databaseRowID -> drafting:recordRowID drafting:databaseFieldName -> drafting:recordFieldName or drafting:fieldName drafting:databaseFieldValue -> drafting:recordFieldValue or drafting:fieldValue

and add a new class called drafting:Table to hold the current properties drafting:databaseTable and drafting:databaseSchema.

These changes would, in my opinion, also make it relatively simple to describe non-relational databases; and other table types found forensics, using the same classes and properties.

ajnelson-nist commented 2 years ago

@gwebb-case , I agree with you on "Record" more typically corresponding with a database row. I'd vaguely recalled relational algebra using "record" when referencing a tuple.

As for your field names, we had discussed leaving the possibility open in the future for a higher-level abstraction of "Table" that could also accommodate a table in a HTML, Word, PDF, or other document. So, these two changes I'd NACK, because "tableSchema" would not mesh well with that.

drafting:databaseTable -> drafting:tableName
drafting:databaseSchema -> drafting:tableSchema

I had thought that TableCell would work better than TableField, but the non-relational database matter wouldn't work as well with TableCell. @kchason suggested looking at MongoDB as a sanity-check for how well this rename would apply. So, I agree with this rename:

drafting:DatabaseRecord -> drafting:TableField
drafting:DatabaseRecordFacet -> drafting:TableFieldFacet

I'm on the fence on whether we need to keep the word "database" in these, as a matter of verbosity vs. concept scope confusion. We might want close to the same property names for doing (in the future) HTML forms. (Skipping a few steps in modeling needs, I could see some desire to certain HTML web form based attacks.) I think for now we would be reasonably future-safe for that other modeling need if we used recordFieldName (/Value) for databases.

drafting:databaseRowID -> drafting:recordRowID
drafting:databaseFieldName -> drafting:recordFieldName
drafting:databaseFieldValue -> drafting:recordFieldValue

ajnelson-nist commented 2 years ago

Implementation has not been completed to the point of exercising the database NULL representation. We could vote on this today if we think it is sufficiently specified, but there is some risk form not testing yet.

ajnelson-nist commented 1 year ago

Looking over the state of the PR, I think there's only one thing I'd like to see changed. If there's anyone else in favor, please note so, and we can make the extension.

observable:recordRowID is currently proposed as a property with range xsd:string. This is not compatible with SQLite rowid tables, where no explicit primary key or INTEGER PRIMARY KEY make the unique row identifier an integer. It's possible for UCO to just cast the integer to a string, but this then burdens UCO consumers with testing for number-ness of each string value.

Should recordRowID have a range of the union of xsd:integer and xsd:string? (For the sake of SQLite supporting negative integers as its rowid (source: see that same page), I won't suggest a xsd:nonNegativeInteger restriction.)

plbt5 commented 1 year ago

Alex,

I know this is a long time since, and I should have reacted sooner. Let me just ask the question and leave it with that. Why are we modelling the design of a relational database schema, as opposed to what it represents?

We can discuss this Friday if you want.

Best Paul

On 26 Jul 2022, at 18:03, Alex Nelson @.***> wrote:

On the CASE Crossover Scenario, a tool's output identified an incorrect association with an ICCID number for a phone. The ICCID number is known to be incorrect because the person who seeded the phone data knows what the ICCID was.

https://caseontology.org/examples/crossover_wmd/ https://caseontology.org/examples/crossover_wmd/ The ICCCID should be 8931088918010550289, whereas the tool claimed a value 89390100002217635543. This discrepancy comes from having selected an incorrect record in a SQLite file.

How should UCO designate a certain record within SQLite? SQLite has the advantage of being able to identify its allocated rows with row_number(). (I had thought there was a ROW_NO, but haven't had a chance to go back through the SQLite docs to confirm.)

https://www.sqlite.org/windowfunctions.html https://www.sqlite.org/windowfunctions.html Additional nuance for this example: This example relies on allocated records. Do we need a mechanism to identify unallocated records as well initially (especially for carving work)?

— Reply to this email directly, view it on GitHub https://github.com/ucoProject/UCO/issues/415, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAWGQK5RKI7T3JU2EUIAS3VWAD35ANCNFSM54WPRG7Q. You are receiving this because you are subscribed to this thread.

gjwebb-case commented 1 year ago

Hi Alex,

I agree with your view, xsd:integer would be by far the better data type to use.

In my opinion xsd:integer would still be best irrespective of whether a record is modelled directly on the Database or is based on what it representing, it would also allow it to be modelled on its location based on a disk offset (e.g a record in unallocated disk space).

packet-rat commented 1 year ago

Structurally, I agree. Higher precision is typically better when modeling the world (I.e., 4096 vs. 4K)

The only challenge is that, as I’m certain has been discussed, different methods are frequently used to characterize file and disk sizes (i.e. 40GB). People and/or tools populating these values are likely to use non-precise terms.

Perhaps a modifier?

Note: I do support Integer.

From: gjwebb-case @.> Sent: Wednesday, November 9, 2022 9:11:20 AM To: ucoProject/UCO @.> Cc: Subscribed @.***> Subject: Re: [ucoProject/UCO] UCO should represent a database record at the record-cell level (Issue #415)

Hi Alex,

I agree with your view, xsd:integer would be by far the better data type to use.

In my opinion xsd:integer would still be best irrespective of whether a record is modelled directly on the Database or is based on what it representing, it would also allow it to be modelled on its location based on a disk offset (e.g a record in unallocated disk space).

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/ucoProject/UCO/issues/415*issuecomment-1308824289__;Iw!!BhdT!g1IBu1mewVDEGmLUXFnrAy1xtSGcU5kpm2Mdtc7rWRI93mDN9xFBd6SrdAxAmk0Hc9Vjm1wxB3_FXb7kKC7UifmTVJo$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAYSFYP2H6SW5DWVQDT6ZJTWHOWIRANCNFSM54WPRG7Q__;!!BhdT!g1IBu1mewVDEGmLUXFnrAy1xtSGcU5kpm2Mdtc7rWRI93mDN9xFBd6SrdAxAmk0Hc9Vjm1wxB3_FXb7kKC7Uk6dBgpo$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ajnelson-nist commented 1 year ago

Re: @gjwebb-case

Hi Alex,

I agree with your view, xsd:integer would be by far the better data type to use.

In my opinion xsd:integer would still be best irrespective of whether a record is modelled directly on the Database or is based on what it representing, it would also allow it to be modelled on its location based on a disk offset (e.g a record in unallocated disk space).

The point I raised about integers was scoped to observable:recordRowID, which currently has this definition:

The unique ID that identifies a database record.

I need to warn everybody about usage of the word "Unique" in general. "Unique" implies a universe of scope and a defining authority. @gjwebb-case , what you suggested about using the location based on disk offset would be incongruous with the an identifier supplied by the database engine. If you need to recover such a record from unallocated space, and you don't have an inlined identifier within that record, it would be inappropriate in my opinion to assign your own unique identifier. In that scenario, you should instead use observable:DataRangeFacet, and leave observable:recordRowID unassigned.

If others agree with avoiding concept conflation, I suggest observable:recordRowID be altered to read:

The unique ID that identifies a database record, supplied by the originating database engine.

Meanwhile, while I prefer xsd:integer in general, I did suggest that we allow observable:recordRowID to be the union of xsd:integer and xsd:float, because I'm aware that some database engines permit primary key values of strings. I've been in an only-numbers-for-primary-keys habit for years now, so I've forgotten if there are other datatypes permitted in single-column primary keys. I defer to people with more database-internals experience on whether we should use integers or a union of integers and strings (and other types?) for observable:recordRowID.

Re: @packet-rat :

The only challenge is that, as I’m certain has been discussed, different methods are frequently used to characterize file and disk sizes (i.e. 40GB). People and/or tools populating these values are likely to use non-precise terms.

Perhaps a modifier?

I suggest not introducing support for "rounding" modifiers like "kiB". The original issue inspiring this proposal needed to select between two records in a single SQLite table that had, IIRC, two records total. Permitting rounding defeats the use case where we need to point to this offset within a disk sector or database page, rather than that offset.

ajnelson-nist commented 1 year ago

Re: @plbt5

Alex, I know this is a long time since, and I should have reacted sooner. Let me just ask the question and leave it with that. Why are we modelling the design of a relational database schema, as opposed to what it represents?

I'm not quite sure what level this question is at, but I'll reply per my best guess.

We are defining a model that can, as a user needs, represent a record with an individual object and a field per layer, rather than requiring representing each layer of the database model (database engine class, database engine instance, backing-store file or disk partition, schema instance if applicable, table instance, row instance, column instance) and relationships between each layer.

We do this in part to support provenience (independent of provenance) - so we can relate a single field within a record geometrically with its containing objects (such as the backing-store storage object).

Does that answer your question?

kchason commented 1 year ago

Looking over the state of the PR, I think there's only one thing I'd like to see changed. If there's anyone else in favor, please note so, and we can make the extension.

observable:recordRowID is currently proposed as a property with range xsd:string. This is not compatible with SQLite rowid tables, where no explicit primary key or INTEGER PRIMARY KEY make the unique row identifier an integer. It's possible for UCO to just cast the integer to a string, but this then burdens UCO consumers with testing for number-ness of each string value.

Should recordRowID have a range of the union of xsd:integer and xsd:string? (For the sake of SQLite supporting negative integers as its rowid (source: see that same page), I won't suggest a xsd:nonNegativeInteger restriction.)

I think this generally makes sense.

I have a concern with just switching it to xsd:integer (or other numeric option) as some databases suggest GUIDs as primary keys. Do we have precedent for having unions of types like this (beyond open vocabularies of xsd:string and a dictionary)?

While not really the concern of the ontology community itself, this would cause issues for some of the strongly-typed bindings that have been developed to help generate CASE graphs and would otherwise cause additional logic needed for consumers of the graphs to account for both types.

ajnelson-nist commented 1 year ago

I have a concern with just switching it to xsd:integer (or other numeric option) as some databases suggest GUIDs as primary keys. Do we have precedent for having unions of types like this (beyond open vocabularies of xsd:string and a dictionary)?

On review, not really.

observable:priority allows a union of integer, string, and ... TaskPriorityVocab. The definition of that property is a bit incongruous with its multiple uses (email messages and Windows processes). I had thought priority was meant to support integers or strings for one of those two uses, but our SHACL shapes only permit xsd:string, without mention of TaskPriorityVocab. So, priority needs a bugfix.

So, I don't think we have precedent, but we have the knowledge of how to encode and test for a union. We can make a union work.

While not really the concern of the ontology community itself, this would cause issues for some of the strongly-typed bindings that have been developed to help generate CASE graphs and would otherwise cause additional logic needed for consumers of the graphs to account for both types.

This is an adoption-level concern that is relevant to the ontology committee. Do we know what the total union of datatypes permitted for primary key usage is? E.g. is a DATETIME permitted in some engine? Should we have the range of observable:recordRowId be the union of string, integer, and xsd:hexBinary (shorter binary data, rather than xsd:base64Binary for longer), with hexBinary being our catch-all for other datatypes that are special to the database engine?

ajnelson-nist commented 1 year ago

I'm becoming hesitant to include observable:recordRowId in this proposal. It's not clear to me how it would be used in these scenarios:

A record from a SQLite table with no PRIMARY KEY defined, everything in an "Allocated" state (not requiring carving). In SQLite, this implicitly uses a field that is an integer, per the docs I cited above.
A record from a SQLite table with a DATETIME field designated its PRIMARY KEY, everything in an "Allocated" state.
A record from a SQLite table with two or more fields defined as a compound PRIMARY KEY, everything in an "Allocated" state.
A record recovered from a SQLite WAL, in any of situations 1--3.
A record from a Microsoft SQL Server table where a GUID is the PRIMARY KEY.

kchason commented 1 year ago

This could instead refer to the internal record ID contained within the database, but it may not be as readily available from standard database queries or record exports into flat files, so I would still prefer this be defined as the defined primary key of the record.

Some databases recommend integers, some suggest strings, and other recommend GUIDs for the primary key field of a database. As @ajnelson-nist pointed out on a call this morning, this also doesn't account for clustered primary keys consisting of two or more fields. I'd still suggest xsd:string since it allows the most flexibility for the different types of keys that might exist.

ajnelson-nist commented 1 year ago

The OCs decided this morning that recordRowId should be retained in this proposal. It will be a union of string and integer. For the compound PRIMARY KEY case, we will consider that a potential addition to the union. Expanding the union will be a backwards-COMPATIBLE change.

ajnelson-nist commented 1 year ago

The definition of recordRowId will be revised to: The unique ID that identifies a database record, supplied by the originating database engine.