usnistgov / oscal-content

NIST SP 800-53 content and other OSCAL content examples
Other
296 stars 123 forks source link

Renovate SP800-53 production pipeline converting to OSCAL from `docx` source #25

Closed wendellpiez closed 3 years ago

wendellpiez commented 4 years ago

Errors in production of SP800-53 Revision 5 in OSCAL, both final and earlier (FPD) versions, show the process needs to be addressed for robustness and maintainability.

A new design will shorten and simplify the initial extraction by performing a generic conversion from the Word document (docx) into HTML, which contains all the necessary information for the OSCAL, and then processing it through a chain of cleanup and enhancement filters. By using the open-source XSweet utility, we can do this end-to-end in XSLT with no other language or application dependencies.

The new pipeline should produce valid and correct OSCAL for the current (final) version of SP800-53 Rev 5, with the same UUIDs as the presently published version (for minimum destabilization). Going forward, it should also be able to produce the same correct outputs for future revisions (with minimal adjustment and assuming consistent formatting in the input data), with UUIDs maintained or refreshed as required.

Issues #16 and #23 can also be addressed in this work, if not already resolved.

Criteria for acceptance:

Producing a valid NVD XML representation of an input catalog -- since we will no longer do this at the beginning -- is not a goal of this Issue, and should be tracked separately. It makes sense to build this as a conversion from the OSCAL produced by this pipeline.

wendellpiez commented 4 years ago

This is nominally complete, in the sense that an XSLT pipeline in production (and committed to an internal repo) appears to produce valid and correct outputs faithfully representing SP800-53 rev 5 as published.

The file produced is now under review internally (see Gitter for share) starting with @david-waltermire-nist and @iMichaela.

wendellpiez commented 4 years ago

Latest on this, a version that replaces "curly quotes" with markup (<q> element tagging).

iMichaela commented 4 years ago

10/15/2020

@wendellpiez Here are the current findings of the first-round review. I will continue adding to the list below as I go through the document.

What are the comments with numeric values e.g.

wendellpiez commented 4 years ago

Thanks! these can also be added to a Schematron filter to help enforce consistency in future.

wendellpiez commented 3 years ago

Okay, I made these repairs. Thanks!

A latest version will be attached here.

wendellpiez commented 3 years ago

Zipped for your safety - rev5-oscal-latest-20201016.zip

wendellpiez commented 3 years ago

Outstanding issues for quality check:

Broken links

Three links (given as a elements) have no targets. This is formally valid (since HTML-valid), but an error. These should be eliminated or targeted.

The links are here:

There is no AC-20(6) enhancement, and no item c under SC-18. The links are probably intended to indicate something nearby.

wendellpiez commented 3 years ago

We have found a couple more issues:

wendellpiez commented 3 years ago

Now I made the Schematron I am glad I did, since there is a latest latest.

rev5-oscal-latest-20201022.zip

wendellpiez commented 3 years ago

rev5-oscal-latest-20201028.zip

Latest corrections:

Couple of outstanding questions:

wendellpiez commented 3 years ago

@iMichaela @david-waltermire-nist -- looking yet one more time I de-confused myself wrt the brackets-around-links issue, and the brackets are now gone from inline links in the SP800-53 Rev 5 (thank you Michaela). Rev 4 in any case did not have so many inline links.

Also @brianrufgsa and @tcorsa since I found that doing this made our whitespace problem worse in a few places (#29), I added new custom serialization logic at the end of the pipeline. So we are no longer using defaults (and if there are further adjustments to whitespace called for we have more leverage). (This is not actually done yet, but it is better as far as the data is concerned.)

wendellpiez commented 3 years ago

Whitespace is fine but for some reason we are getting duplicate rlink elements in back matter resources.

This error appears to go back a ways (at least a couple of weeks) which would help explain why checks against regression have not caught it.

I will add a Schematron to detect this issue in future, and repair the data.

wendellpiez commented 3 years ago

Update Nov 5 2020

The catalog and profiles in PR #32 look good after most recent improvements but please double check, including diffing with previous versions. Still awaiting word from FISMA team on remaining link issues.

Also from recent discussions it appears we may have work to do on the PRIVACY baseline. (Separate issue?)

So here we are:

wendellpiez commented 3 years ago

Update Dec 3 2020

Presuming that builds work over PR usnistgov/oscal-content#32 (which contains current pipeline output), and that profile resolution produces expected results, this Issue is complete.

The conversion pipelines including docx->OSCAL catalog and profiles is being maintained in an internal repository.

Development of this pipeline through the QA process for Rev 5 has suggested more use for Schematron to validate the correctness of pipeline outputs. Suggest a new Issue for this. The repo already has some Schematron; we could use more (to check on things like file reference integrity).

david-waltermire commented 3 years ago

This was addressed with the merge or PR #32.