analyze past SSPs - Githubissues

What do we want to know from past SSPs? Running list:

[ ] Basic counting
- [x] Overall number of words
- [x] Number of controls
- [x] Size/distribution of controls in words
[x] Most commonly-selected controls
[ ] Examples on controls across systems (for qualitative comparison)
[ ] Duplication
- [ ] How many exact-same-phrases are within an SSP?
- [ ] What language is copied between SSPs? (inspiration: model legislation report)
- [ ] What sentences are similar but not exactly the same? h/t @gregelin
[ ] What words are commonly used per control?
[ ] What words are frequently used in conjunction with one another? (in same control, or close to one another in a particular control)
[ ] Find synonyms
[ ] What are the underlying components?
- Finding all controls that reference something like “SSH” and pull out what can be the best common control implementation
- Cross-referenced controls
[ ] Grouping/categorization of systems
[x] Commonly used (proper?) nouns
- https://www.youtube.com/watch?v=YrFOAhT4Azk

Also,

[x] Add analysis questions from research synthesis
[x] Get feedback on the above from @smh2019

This will likely be broken out into smaller issues. Doing the actual analysis is blocked by #14.

one note on basic word frequencies in your counting section: in long, complex technical documents, you might not get any meaningful insight by comparing counts across the whole document. If there's some way to break down by section, and treat paragraphs or sections themselves as documents, example, that might be a preferable approach.

To identify synonyms, you might want to use a specialized corpus -- something like SESim, a corpus built from Stackoverflow question and answers

Might want to do some topic modeling, if there are "topics" along which you can reasonbly suspect ATOs break down into. (this is unsupervised approach, which is helpful in case you can't get your hands on a huge training set of tagged-by-topic documents)

uscensusbureau / fismatic

analyze past SSPs #17