between the Clinton emails and the Podesta leak, it seems to me that many document sets include a ton of copy-pasted news articles. By themselves, these are really boring and can obscure more interesting stuff. It'd be neat to classify/rank documents by whether they're mostly boilerplate (signatures, disclaimers) and news articles and therefore boring.
between the Clinton emails and the Podesta leak, it seems to me that many document sets include a ton of copy-pasted news articles. By themselves, these are really boring and can obscure more interesting stuff. It'd be neat to classify/rank documents by whether they're mostly boilerplate (signatures, disclaimers) and news articles and therefore boring.