uscensusbureau / fismatic

https://github.com/uscensusbureau/fismatic/projects/1
Other
11 stars 10 forks source link

analyze past SSPs #17

Open afeld opened 5 years ago

afeld commented 5 years ago

What do we want to know from past SSPs? Running list:

Also,

This will likely be broken out into smaller issues. Doing the actual analysis is blocked by #14.

smh2019 commented 5 years ago

one note on basic word frequencies in your counting section: in long, complex technical documents, you might not get any meaningful insight by comparing counts across the whole document. If there's some way to break down by section, and treat paragraphs or sections themselves as documents, example, that might be a preferable approach.

To identify synonyms, you might want to use a specialized corpus -- something like SESim, a corpus built from Stackoverflow question and answers

Might want to do some topic modeling, if there are "topics" along which you can reasonbly suspect ATOs break down into. (this is unsupervised approach, which is helpful in case you can't get your hands on a huge training set of tagged-by-topic documents)