Replace ID with UUID and make UUID Required Everyplace it is Defined

brian-comply0 commented 7 months ago

User Story

As an OSCAL tool developer I want a cleaner, more consistent and more predictable way to deal with unique identifiers within OSCAL content so that I can simplify software development requirements.

OSCAL is intended to be machine readable. The retention of id flags and values is for humans working with raw OSCAL, and add unnecessary complication to machine processing of OSCAL These id values are never (and should never be) exposed to end-users of OSCAL tools.

Background and Additional Details

Pre 1.0.0 release, a decision was made to use UUID in many places in the OSCAL syntax and to require in most of those places. The use of UUID has proven to be a significant benefit to software development where it is implemented and required. Especially when dealing with discrete portions of OSCAL content such as normally happens when a web application is designed with an n-tier architecture.

At that time a decision was also made to continue using id in places where a canonical reference may be desirable. This has proven to represent a challenge for developers, sometimes resulting in kludgy work-arounds. Similarly, when dealing with assemblies where UUID is optional, it has proven to be a challenge for n-tier architected software to handle the use cases where UUID is only intermittently present.

Software Development Challenges:

As there is no single standard for the actual values of @id, each OSCAL tool developer is forced to create their own ID value creation standard when working with catalog control content or metadata role content.
The standard guidance offered by NIST for the above situation is that tool developers can just generate UUID values for the ID flags. The problem with this is that not all UUID values are NCname compliant as NCname requires the first character to be alphabetic and UUIDs can start with a letter or number. This forces the need to prefix the identifier, creating a kludgy ID value that is not friendly to humans nor machines. A true UUID value in these cases would create a more straight-forward approach. Especially for control parts.
Most modern web applications avoid loading a whole document into browser memory and only transmit the portion of OSCAL content being viewed or edited between front and back-end servers. Indeed, with sensitive security content and "need to know" principles, tools are sometimes required to only load a portion of OSCAL content into browser memory.
- ID and UUID flags are well-placed in the OSCAL syntax to support this more discrete exchange of information; however, when an ID/UUID is optional and omitted by content author or other tool it becomes very challenging to use this well-established approach to managing information as discrete pieces.

If ID flags were replaced with UUID flags and all UUID flags were required, OSCAL content would be far more normalized in terms of a tool developer's ability to reference discrete content.

Goals

Replace all @id flags with @uuid flags
Establish a canonical-identifier property which can be used by implementations (if any) that specifically care about using a canonical identifier
Require @uuid everyplace it is implemented

Dependencies

No response

Acceptance Criteria

[ ] All OSCAL website and readme documentation affected by the changes in this issue have been updated. Changes to the OSCAL website can be made in the docs/content directory of your branch.
[ ] A Pull Request (PR) is submitted that fully addresses the goals of this User Story. This issue is referenced in the PR.
[ ] The CI-CD build process runs without any reported errors on the PR. This can be confirmed by reviewing that all checks have passed in the PR.

(For reviewers: The wiki has guidance on code review and overall issue review for completeness.)

Revisions

No response

iMichaela commented 7 months ago

@brian-comply0 - If I understand correctly your request, such change will NOT be backwards compatible and it can only be done under a major version. If a human-oriented id would be the major complaint here, an uuid that starts with a letter will meet the id type (NCname) requirements, so a developer could use an uuid generator function and enhance it by discarding the values starting with a number and retaining the one that start with a letter? With that said, changes to a control in a catalog will have then to follow the uuid guidance, and be updated any time the control statement gets updated, and preserved when no update to the control statement takes place. Today the ids are not following such rule for traceability reason (please recall the zero-padded IDs issue). Retaining the previous uuids for controls not changed will have to be handled by tools that are updating the OSCAL catalogs. Do you have a vision for an automatic process that can be implemented with ease and provide robust, consistent output? Using hashes for each statement comes to ming. But hashes will change if a simple comma or a space is added, or a new line, or if the parameter's uuid changes for reasons outside the control statement scope. Such change should not trigger a new uuid for the control statement (per current guidance which, BTY, is also a controversial issue among the OSCAL adopters). While your request is not unreasonable, it requires thorough research and community's feedback before considering it even for a major version. My major concern is around the operational impact such change will have (catalog/profiles generation, and their use in implementations and assessment).

wendellpiez commented 6 months ago

A favorable read suggests:

you feel it would be an improvement to spell out 'canonical-identifier' (to discourage overloading?) instead of 'id' - though they are functionally equivalent otherwise (wrt e.g. rules regarding scope of distinctiveness, etc. etc.)
you think that uuid is so far superior for 'daily wear' that you want not only to allow it everywhere, but to require it and see it required in the standard, rather than leave this as a local rule or application rule

Either of these might be considered separately on its merits.

One problem with permitting (and requiring) UUID everywhere is the various ambiguities regarding link targets. I.e. what does the URI syntax # for fragment identifier mean, in OSCAL? Hitherto it has meant "whichever of 'id' or 'uuid' is given". We haven't had to deal with question of what if both are given for an object, or rules regarding their clashing, etc. etc. (Given the current rules it is hard albeit not impossible to get them to clash.)

If developers really wanted to have either/both (by whatever name) I suppose the above question could be dealt with - but we are again playing wackamole with the complexity.

100% agreed with these not being values in the data to show to end users, at least ordinarily. However their 'wetware processing quotient' is still a factor, under various kinds of scenarios (including debugging scenarios not only ordinary workflow).

And many of your complaints could also be addressed by other means, such as a recommendation for a nice portable "how to make a sensible ID rule" that organizations could (re) use. It might well entail the notion of identifying the "canonical identifier" as you call it, at least to start with.

(My own recommendation has two steps: 1. identify the Canonical ID you already use, then 2. cast it into NCName form, all lower-case if possible. If you have no canonical ID form, one can be created, but most do.)

The NCName requirement is alas an old old legacy, just hard to give up (many system-level XML functions rely on the syntax in its current form) - but you haven't actually suggested giving up on that. It's not all that difficult to follow once we decide it's worth the small cost - and to make a feature out of it.

Do you feel current problems / hesitations would be adequately addressed by more detailed guidance on how to form or fashion IDs?

How about the caution that (as many people forget) since these documents may be passed between systems, in order to support robust addressing we are always going to need a document's ID along with the link target's ID? (Unless the use case says otherwise?) Since the same target in a different document is not the same - even though it must say it is, we can tell the difference (because same ID or UUID, but different document). We try to alleviate this notionally in OSCAL but we have barely started (with some idea of a "document/import space" for addressing, bigger than a document but smaller than the universe). Work in this area could help provide illustrations.

If above proposals don't cut it, what is the minimum that might?

Telos-sa commented 6 months ago

I agree with @iMichaela in that the structure requirements need to be agnostic and support the various use cases that will crop up, including inconsistent presence of elements. Since some use cases will want human readable ID's to vet their requirements are being met (role-id), there are instances where UUID is critical for linking resources back to their source elements (leveraged-authorization and by-components). Leveraging props and name-spaces allows for these unique scenarios, that you can then reference back to.

Instead, what I would love to see is a unified approach to generating the UUIDs to make it easier to validate the contents and reduce time to ATO.
Telos's method is using leveraging bottom up data analysis to generate a hash of the content that becomes the source of the UUID. If the data is the same, then the UUID is the same across versions. If the data changes, then the UUID changes.

This will be vital as we move towards cross version review, where UUIDs can be compared for changes, and changes validated.

iMichaela commented 6 months ago

Instead, what I would love to see is a unified approach to generating the UUIDs to make it easier to validate the contents and reduce time to ATO.

I agree with you, Lacy (@Telos-sa). Consistency is very important. Automating the process is important.

Telos's method is using leveraging bottom up data analysis to generate a hash of the content that becomes the source of the UUID. If the data is the same, then the UUID is the same across versions. If the data changes, then the UUID changes.

An output of a hash function will be different if an extra space is inserted, or if you use different operating systems to calculate it, or a simple, insignificant punctuation was added. Maybe you could strip all spaces and punctuation before generating the hash? And even in this case, a typo corrected will result in a new hash, but is should not trigger a uuid change. A process that is not well thought through might allow a diabolic mind to trigger a lot of unnecessary assessment work and overwhelm the system so much that can induce a denial of service. All is needed is a kiddy script that is adding and removing a space in a document that is core document for the automation process... Can NLP do a better analysis and determine when a significant change happened?

Telos-sa commented 6 months ago

Yes I believe so. Being able to strip out spaces, punctuation, html, and looking at pure content should be adequate. Brad, what do you think about incorporating NLP for a future iteration, to compare content before generating UUID? First iteration would be standard strip, but consecutive changes, this could be a good methodology.

Stephanie Lacy | Senior Solutions Architect

@.*** | www.telos.comhttp://www.telos.com/

[signature_19392405]

From: Michaela Iorga @.***> Sent: Monday, April 1, 2024 8:56 PM To: usnistgov/OSCAL Cc: Telos Solutions Architects; Mention Subject: [Caution: External] Re: [usnistgov/OSCAL] Replace ID with UUID and make UUID Required Everyplace it is Defined (Issue #1990)

Instead, what I would love to see is a unified approach to generating the UUIDs to make it easier to validate the contents and reduce time to ATO.

I agree with you, Lacy @.***https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Telos-2Dsa&d=DwMCaQ&c=fwF34uzOsSLA_QyctP8xMw&r=pfbmGckWtc_qcwAJ-keRNhRhyEJgJRmWabzEn4YEDpk&m=6EIZaAm3Uv_kEx-vBAv50su69Uyc4f74sakHorfurGcNM3xZtpnR7R9GHj72VYxc&s=alBHAqcDyRfAquQUGjTh7oB5hrJ1YjULuA9zUYLtmCM&e=). Consistency is very important. Automating the process is important.

Telos's method is using leveraging bottom up data analysis to generate a hash of the content that becomes the source of the UUID. If the data is the same, then the UUID is the same across versions. If the data changes, then the UUID changes.

An output of a hash function will be different if an extra space is inserted, or if you use different operating systems to calculate it, or a simple, insignificant punctuation was added. Maybe you could strip all spaces and punctuation before generating the hash? And even in this case, a typo corrected will result in a new hash, but is should not trigger a uuid change. A process that is not well thought through might allow a diabolic mind to trigger a lot of unnecessary assessment work and overwhelm the system so much that can induce a denial of service. All is needed is a kiddy script that is adding and removing a space in a document that is core document for the automation process... Can NLP do a better analysis and determine when a significant change happened?

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_usnistgov_OSCAL_issues_1990-23issuecomment-2D2030887571&d=DwMCaQ&c=fwF34uzOsSLA_QyctP8xMw&r=pfbmGckWtc_qcwAJ-keRNhRhyEJgJRmWabzEn4YEDpk&m=6EIZaAm3Uv_kEx-vBAv50su69Uyc4f74sakHorfurGcNM3xZtpnR7R9GHj72VYxc&s=gh80XILvWFWcNN_o5v_NtytgUZRS5Sp0oswQ4JAS2nU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_A6KF2RP7MFP7T6M5ZLYTE4TY3H62PAVCNFSM6AAAAABEAEDD3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQHA4DONJXGE&d=DwMCaQ&c=fwF34uzOsSLA_QyctP8xMw&r=pfbmGckWtc_qcwAJ-keRNhRhyEJgJRmWabzEn4YEDpk&m=6EIZaAm3Uv_kEx-vBAv50su69Uyc4f74sakHorfurGcNM3xZtpnR7R9GHj72VYxc&s=7M3ODgeCpm7YdcLhSNIfW2xzfBykJZRhjcqvEogWyYg&e=. You are receiving this because you were mentioned.Message ID: @.***>

wendellpiez commented 6 months ago

If the production of a UUID from given spans of content is expected to be deterministic and reproducible, I'd recommend thinking early about formal definition, specification and conformance testing.

In other words, it needs to be possible to say which of two implementations that give different UUIDs for 'the same' content (as defined) is the correct one, whenever they differ. How do we know which is right?

In passing, I note that if mandated, this mechanism effectively changes the semantics of UUIDs from 'identifiers' (that is, 'tags' - information added by some person or process) to 'comparands', that is a basis for comparison, but not actually more information, because you can always (knowing the rules) derive a UUID again from its content. (This is assuming that everyone is fine with all elements whose text comes out 'N/A' having the same UUID, which seems backward to me.)

In fact you had better know those rules and/or have a trusted implementation in hand, if you want to validate that UUIDs are aligned with the text as expected.

If that makes you squirm (it might, and not only for reasons @imichaela suggests), reflect that what this means is not that certain validations and assurances do not have to be done: it just moves where they are done, by whom and how. (Introducing a 'trust vector' in doing so.)

This makes me think that specifying, testing and hardening such an algorithm in public could be a very good thing, while mandating the use of such an algorithm without the testing and hardening would be a very bad thing. (Fortunately that hasn't been suggested.)

Meanwhile if organizations wish to add normalized hashes or other enhancements to their data to enable optimizations (even if only 'decorations' from an information-theoretic point of view), they certainly can do that.

For a more 'standard' approach, there has to be a reference in the form not only of formal definitions but also of test suites - or (I fear) the standard will be toothless - while nonetheless introduction friction, costs, complicating factors and security risk as @imichaela describes (worst-case).

While pondering that another question to ask is what problem is this solving: are there other approaches to consider as well? not only standards but best practices and reusable solutions? tools?

And indeed, 'deepening' the semantics of UUIDs wasn't originally proposed here except by implication - only requiring them everywhere. (Which in itself is bike-shedding, outside a 2.0 working setting.)

usnistgov / OSCAL