Open afeld opened 5 years ago
one note on basic word frequencies in your counting section: in long, complex technical documents, you might not get any meaningful insight by comparing counts across the whole document. If there's some way to break down by section, and treat paragraphs or sections themselves as documents, example, that might be a preferable approach.
To identify synonyms, you might want to use a specialized corpus -- something like SESim, a corpus built from Stackoverflow question and answers
Might want to do some topic modeling, if there are "topics" along which you can reasonbly suspect ATOs break down into. (this is unsupervised approach, which is helpful in case you can't get your hands on a huge training set of tagged-by-topic documents)
What do we want to know from past SSPs? Running list:
Also,
This will likely be broken out into smaller issues. Doing the actual analysis is blocked by #14.