Investigate methods of finding a similarity rating between text documents.

We can use TF/TF-IDF (term frequency or term frequency inverse document frequency) to produce vectors which can represent our documents.

Another possible document encoding is word2vec. This is better for retaining contextual information between phrases.

We can then use classical machine learning methods on the data. We need a way of ranking similarity between the encoded document. Once we have this document, we can use a clustering algorithm with the similarity metric to cluster documents by similarity.

michaeljohnclancy / news_scraper

Investigate methods of finding a similarity rating between text documents. #19