Problem: Representation of objectives changed between 800-53 Rev 4 and 800-53 Rev 5 breaking parsers

gregelin commented 1 year ago

Describe the bug

The representation of security control assessment objectives in the OSCAL 800-53 catalogs published by NIST on GitHub changed between Rev 4 and Rev 5 and broke existing code for parsing and generating OSCAL catalogs.

The change was multidimensional and significant enough that the parser and generator need to be extensively re-written to support the new format.

Three dimensions of the objective changed between Rev 4 and Rev 5:

The location of the assessment objectives associated a security control in the OSCAL hierarchy
The pattern used to express the value of the 'id' of the part that represents the objective
The string used as the name in the value of the 'name of the part that represents the objective

The change in the value of name also appears to be a multidimensional change. Instead of a singular renaming of objectives to assessment-objective, multiple terms were introduced for objectives (e.g., assessment-objective and assessment-method) and the term assessment-objective appears effectively overloaded in that sometimes assessment-objective identifies a grouping of objectives without prose and sometimes assessment-objective identifies an actual objective with prose.

"parts": [
{
  "id": "at-3_obj.a",
  "name": "assessment-objective",
  "props": [
    {
      "name": "label",
      "value": "AT-03a.",
      "class": "sp800-53a"
    }
  ],
 "parts": [ "#content clipped" ]

This means logic must be written in the application for storing and in the UI to distinguish between the node that should NOT have prose and a node that has prose but happens to be empty; Null is no longer sufficient; It is no longer possible to get a simple list of objectives by searching for names because now there are multiple classes of objectives.

Who is the bug affecting

This problem is currently affecting GRC vendors and other tool makers seeking to read/write OSCAL catalogs.

What is affected by this bug

CI/CD, OSCAL Content, Documentation, Modeling, Tooling & API, Website

How do we replicate this issue

Download 800-53 Rev 4 and 800-53 Rev 5 catalog:

Examine the representation of objectives for AT-3 in 800-53 Rev 4 and Rev 5...

800-53 Rev 4 Objective representation

In 800-53 Rev 4, objectives are represented as parts on statements in the following form: 

{ "id": "at-3.a_obj",
  "name": "objective",
  "props": [
    {
      "name": "label",
      "value": "AT-3(a)"
    }
  ],
  "prose": "provides role-based security training to personnel with assigned security roles and responsibilities before authorizing access to the information system or performing assigned duties;"
}

The id pattern is: {control path identifier <control_id>.<control_part>}_obj The name pattern is: objective

800-53 Rev 5 Objective representation

In 800-53 Rev 5, both the pattern of the objective identifier and the name of the part changed In 800-53 Rev 5, objectives are represented as parts on statements in the following form:

{
  "id": "at-3_obj.a.1-1",
  "name": "assessment-objective",
  "props": [
    {
      "name": "label",
      "value": "AT-03a.01[01]",
      "class": "sp800-53a"
    }
  ],
  "prose": "role-based security training is provided to {{ insert: param, at-03_odp.01 }} before authorizing access to the system, information, or performing assigned duties;"
}

The id pattern is: <control_id>_obj.<control_part> The name pattern is: assessment-objective

Expected behavior (i.e. solution)

The desired behavior is that basic parsing script for an OSCAL Catalog and OSCAL release (e.g., 1.0.3) will correctly parse all OSCAL Catalogs. I say desired behavior because OSCAL is still under development and different catalogs may differ significantly in their representation of various catalog concepts defined OSCAL.

The expected behavior is that a basic parsing for an OSCAL Catalog and OSCAL release will correctly parse all catalogs produced from the same source across all versions of the Catalog with minor modifications.

It was not surprising that changes would exist between Rev 4 and Rev 5 in the same release of OSCAL. It was surprising to find so much change within the representation of a single type of content.

We expected the objective "id" value pattern to be consistent between Rev 4 and Rev 5 as there seems little reason change the identifier pattern;
We expected the name of the part (e.g., "objective") to remain consistent between Rev 4 and Rev 5;
We expected the location of the objectives in the hierarchy to be the same (though we were less surprised that this moved)

Our team expected at most only one meaningful parser-detectable attributes to change between versions. We did not expected all meaningful parser-detectable attributes -- identifier and part name and location -- to change simultaneously.

After noticing changing in id format and name, we expected just a different name of the additional of multiple

Other comments

This issue focuses on the changing representation of objectives. But we discovered this problem after our parsers first broke the multidimensional changes to the organizational defined parameters between Rev 4 and Rev 5. That made two unexpected changes that are (1) breaking changes in that they broke our working code, (2) requiring extensive human intervention to correctly resolve

This means multiple representations of content that we reasonably expected to be standardized are in fact changing even when issued from the same content provider.

We are discovering similar multidimensional differences between NIST OSCAL content and FedRAMP OSCAL content.

aj-stein-nist commented 1 year ago

Thanks for your report. There is a lot of good detail in here to consider, but it will take some time to analyze and bring into sprint, in that order. I am tentatively adding this for Sprint 65 (not this sprint, but the following one, for the second half of March; we will start moving to a bi-weekly sprint soon, heads up and expect more communication on this soon enough).

gregelin commented 1 year ago

@aj-stein-nist Glad the detail was useful. It makes sense to spend some time analyzing this issue. Multidimensional changes in produced data significantly raises the required sophistication of the parsers and/or raises the cost of parsing. And I worry that limits the parties willing to write parsers and slows adoption.

aj-stein-nist commented 1 year ago

@aj-stein-nist Glad the detail was useful. It makes sense to spend some time analyzing this issue. Multidimensional changes in produced data significantly raises the required sophistication of the parsers and/or raises the cost of parsing. And I worry that limits the parties willing to write parsers and slows adoption.

Will you be able to discuss the design of your parser given the upcoming conversation of this work?

Additionally, and separate of this work item, we had discussed the possibility of pairing and looking together at the NIST SP 800-53 Revision 4 and 53/53A Revision 5 catalogs to address some of your concerns around a different set of concerns (not in this issue), but similarly related. Can we discuss that via Gitter and come up with a game plan before this work? It seems important we understand some of your challenges, and that is going to require some deeper higher-bandwidth conversations while looking at the models. Let me know, thanks.

GaryGapinski commented 1 year ago

I suspect NIST IR 8011 is related.

GaryGapinski commented 1 year ago

I suspect NIST IR 8011 is related.

If one

is interested in security assessments (an atypical interest enjoyed only by the cognoscenti)
is fascinated with the prospect of automating such (à la James Watt)
finds OSCAL sufficiently interesting to consider using it as a sustaining component of the grand scheme
has witnessed https://github.com/usnistgov/oscal-content and thought it might be a handy fuel for the engine
is keenly focused on assessment objective achievement as the raison d'être for assessment bliss

then one might regard the authors of NISTIR 8011 as kindred souls.

gregelin commented 1 year ago

@aj-stein-nist Thanks for adding this issue to sprint 65. I can discuss aspects of our OSCAL parser; and since I've now seen and/or written multiple parsers for OSCAL and Open-Control I think I can share some thoughts on parsing practices and strategies and how well each handles multi-dimensional changes.

Let's consider a BasicParser for OSCAL Catalogs...

BasicParser is built following agile principles: the "simplest solution that will work" to create an MVP and improvement through iteration. At the time of BasicParser MVP and its few iterations are being developed, pretty much a single sample catalog, NIST 800-53 Rev 4, is available in OSCAL to develop against and NIST, at the time BasicParser is written, is not yet publishing multiple example catalogs of multiple frameworks such as GDPR, CMMC, PCI, ISO 27001 to run the parser against.

BasicParser is written by Chris (a persona). Chris is 90% likely to be either a Compliance SME who can code, or a developer who having done a couple of ATOs never wants to write an SSP again. Chris has moderate to pretty good coding skills, works in the web application space, and has crawled and/or parsed a variety CSV files, serialized content (JSON, YAML, XML) and semi-structured content using regex and parsing libraries. There's a 10% chance that Chris has a CompSci PhD and codes in C; and a less 1% chance that Chris routinely writes interpreters or XSL processors. If BasicParser is written by the rare Chris with a CompSci PhD, there's a 99.9% chance that Chris knows little to nothing about ATOs, Security Controls, and the 800-53 and is working with a Compliance SME.

Embracing agile, Chris gets the sample data set of 800-53 Rev 4 catalog in OSCAL, and searches for a package someone else has written to parse OSCAL to see if its done, and not finding any (at the time), looks for a standard library to consume the JSON (or YAML or XML) or tries some regex or simple XLST. The goal: the simplest solution that can work for an MVP.

The really simple parsing strategy Chris first tries is based on regex alone or a JSON or XML reader plus regex shows promise to do things like pull out the controls. The controls after all are the meat of the content. Then Chris tries to reconstitute the text strings in the Word version of 800-53 Rev 4 and notices the recursion of the control prose. The text is not only split up, its recursive. And there seems other things are hanging off that recursion, too. This is the first complication that necessitates changing the parsing strategy of BasicParser, even to get to MVP.

Chris digs in deeper, going back and forth between the NIST OSCAL documentation still under development and the reference catalog of 800-53 Rev 4. How consistent is the structure and the recursion? Some patterns in the recursion begin to make sense. Through a mix of nested if-then statements and a one or three recursive functions, Chris has made BasicParser MVP!

BasicParser MVP doesn't do much with the UUIDs or the props because they don't seem to have much impact on the extraction of catalog's controls and parameters. During iterations, BasicParser gets better at handling props to help sort controls. (It won't be until later, when Chris is enhancing BasicParser to parse an SSP, that UUIDs and props reveal themselves in all their glory as the second and third complications, that Chris begins rethinking life choices.)

As Chris iterates BasicParser, improvements are made. The schema starts to be used to validate content as Chris starts generating a few catalogs in OSCAL. Chris's generated hierarchy and recursion follows the one known example.

BasicParser, built in an agile and iterative fashion on top of a tiny sample set, encounters no exceptions to a variety of assumptions about the structure of an OSCAL catalog that seem perfectly reasonable based on both the sample data and the official documentation. For example in Rev 4, all objects are just a type of part. Every object part has a prose key, and the suffix of the object id is consistent with a simple hierarchy. BasicParser can recurse through the parts and easily identify that a part is an object via a regex math on the part.id or part.name.

Chris's colleagues are impressed! BasicParser can extract controls and parameters, and objectives and links and metadata from an OSCAL catalog. No more custom, fragile regex used to separate compound text strings inside of spreadsheet cells! No more changing the parser for every organization or vendor spreadsheet! This serialized, standardized OSCAL catalog is clearly better. Once other catalogs are expressed in OSCAL, it will be possible to consume the information with BasicParser!

But alas, BasicParser is making assumptions that there are patterns to identifiers, assumptions that the recursion is consistent, and assumptions that nodes are always located in the same place in the hierarchy. BasicParser assumes all swans are white because Chris has only really seen one swan...

gregelin commented 1 year ago

@GaryGapinski You've made me a fan of NIST IR 8011! Thanks!

aj-stein-nist commented 1 year ago

Let's consider a BasicParser for OSCAL Catalogs...

OK sounds good.

BasicParser is written by Chris (a persona). Chris is 90% likely to be either a Compliance SME who can code, or a developer who having done a couple of ATOs never wants to write an SSP again. Chris has moderate to pretty good coding skills, works in the web application space, and has crawled and/or parsed a variety CSV files, serialized content (JSON, YAML, XML) and semi-structured content using regex and parsing libraries. There's a 10% chance that Chris has a CompSci PhD and codes in C; and a less 1% chance that Chris routinely writes interpreters or XSL processors. If BasicParser is written by the rare Chris with a CompSci PhD, there's a 99.9% chance that Chris knows little to nothing about ATOs, Security Controls, and the 800-53 and is working with a Compliance SME.

Thanks for this level-setting, it helps set a good frame of mind for the rest (I have read it once quickly, once slowly by now).

Embracing agile, Chris gets the sample data set of 800-53 Rev 4 catalog in OSCAL, and searches for a package someone else has written to parse OSCAL to see if its done, and not finding any (at the time), looks for a standard library to consume the JSON (or YAML or XML) or tries some regex or simple XLST. The goal: the simplest solution that can work for an MVP.

Also good context. To be clear, this means start from scratch, and only processing the resulting OSCAL JSON (or YAML or XML, but primarily JSON form) of OSCAL and nothing else, correct?

The really simple parsing strategy Chris first tries is based on regex alone or a JSON or XML reader plus regex shows promise to do things like pull out the controls. The controls after all are the meat of the content. Then Chris tries to reconstitute the text strings in the Word version of 800-53 Rev 4 and notices the recursion of the control prose. The text is not only split up, its recursive. And there seems other things are hanging off that recursion, too. This is the first complication that necessitates changing the parsing strategy of BasicParser, even to get to MVP.

Can you further explain "recursion of the control prose" with a little more detail to make sure we best understand the issue here?

Chris digs in deeper, going back and forth between the NIST OSCAL documentation still under development and the reference catalog of 800-53 Rev 4. How consistent is the structure and the recursion? Some patterns in the recursion begin to make sense. Through a mix of nested if-then statements and a one or three recursive functions, Chris has made BasicParser MVP!

I guess this is good news, but is the implication the structure and structure is not consistent in some parts, but is in others? I think we would benefit from some more detail here and there.

BasicParser MVP doesn't do much with the UUIDs or the props because they don't seem to have much impact on the extraction of catalog's controls and parameters. During iterations, BasicParser gets better at handling props to help sort controls. (It won't be until later, when Chris is enhancing BasicParser to parse an SSP, that UUIDs and props reveal themselves in all their glory as the second and third complications, that Chris begins rethinking life choices.)

OK this is great, thank you for this example of detail, this is the kind of thing I want to focus on with a more detailed developer pairing later, if you do not mind.

As Chris iterates BasicParser, improvements are made. The schema starts to be used to validate content as Chris starts generating a few catalogs in OSCAL. Chris's generated hierarchy and recursion follows the one known example.

Excellent progression!

BasicParser, built in an agile and iterative fashion on top of a tiny sample set, encounters no exceptions to a variety of assumptions about the structure of an OSCAL catalog that seem perfectly reasonable based on both the sample data and the official documentation. For example in Rev 4, all objects are just a type of part. Every object part has a prose key, and the suffix of the object id is consistent with a simple hierarchy. BasicParser can recurse through the parts and easily identify that a part is an object via a regex math on the part.id or part.name.

Thanks, this is good lead-in to the kind of detail I was looking for.

Chris's colleagues are impressed! BasicParser can extract controls and parameters, and objectives and links and metadata from an OSCAL catalog. No more custom, fragile regex used to separate compound text strings inside of spreadsheet cells! No more changing the parser for every organization or vendor spreadsheet! This serialized, standardized OSCAL catalog is clearly better. Once other catalogs are expressed in OSCAL, it will be possible to consume the information with BasicParser!

But alas, BasicParser is making assumptions that there are patterns to identifiers, assumptions that the recursion is consistent, and assumptions that nodes are always located in the same place in the hierarchy. BasicParser assumes all swans are white because Chris has only really seen one swan...

OK, so this is a wonderful start, but when and how can we talk about specific consistences in structure and recursion, or lack thereof, for this notional parser? I asked some questions about those details, as opposed to comments, in between topics of interest above. We would appreciate if we can understand specific issues with this notional parser approach (if not an actual parser), because we need to figure out: 1) what are the key differences between Revision 4 and Revision 5 and 2) if they are significant beyond additional props (my assumption from prior analysis) how do they break the parser and cause exceptions/error behavior to incompletely parse any (not some) of a catalog, or until I get specific explanation, just make parsing more complex and means a parse continues but key information is missing because some of these relationships have changed in some significant way?

Does that make sense? If we prioritize this for this upcoming sprint starting on Thursday, we will still need some key questions answered in the first few days, or I will need to push the work on this until we are on firmer ground. I hope that makes sense. (We can keep it agile for both parties.)

gregelin commented 1 year ago

A.J.,

Thanks for your detailed comments. I’ll endeavor to quickly send a follow up email (or GitHub post) with more detailed response.

I’m happy to discuss with NIST team privately very detailed information where the notional (and real) parser breaks. I wouldn’t want to commit to paper too much detail about the real parser. Fortunately, GovReady’s code is open source so we can look at excerpts of how that code dealt with specific recursion and hierarchy. And I could do a recorded session on that real parser that could be public.

I can give a quick question for more information about what changed between Rev 4 and Rev 5 beyond additional props. Every identifying aspect of an object changed between Rev 4 and Rev 5: Identifier pattern, part name, and position in hierarchy. I can’t see even a tokenizing parser would have been able to recognize and import an objective between Rev 4 and Rev without an explicit change. See https://github.com/usnistgov/oscal-content/issues/194.

Additionally, two different patterns were introduced for parameters. Frankly, I thought – more accurate to say assumed and hoped -- that the OSCAL standard would enforce all catalogs to use a standard format for org defined params, e.g., “control-id_prm_linearindex” [ ac-2_prm_1 ] . The idea of a single approach to parameters, even if indexed within controls instead of across entire catalog, was one of my favorite benefits of OSCAL. I thought that format was part of the standard. Admittedly, I never looked or asked to see if the pattern was in the specification. I started writing GovReady’s parser before OSCAL Rev 5 came out, so standardization of parameter identifier was an easy assumption to make. And it never occurred to me that it was within the specification for the same catalog to use multiple patterns. No UUID was assigned to parameters, so that reinforced my assumption: no need for a parameter UUID for persistence because the Id would have persistence once assigned. ( The cost of that change has been extraordinarily high, and continues to increase. )

Greg Elin Principal OSCAL Engineer @.*** | m: 917-304-3488

From: A.J. Stein @.> Date: Tuesday, March 14, 2023 at 8:20 PM To: usnistgov/OSCAL @.> Cc: Greg Elin @.>, Author @.> Subject: Re: [usnistgov/OSCAL] Problem: Representation of objectives changed between 800-53 Rev 4 and 800-53 Rev 5 breaking parsers (Issue usnistgov/oscal-content#194)