michaeljohnclancy / news_scraper

0 stars 0 forks source link

Investigate methods of finding a similarity rating between text documents. #19

Open michaeljohnclancy opened 5 years ago

michaeljohnclancy commented 5 years ago

We can use TF/TF-IDF (term frequency or term frequency inverse document frequency) to produce vectors which can represent our documents.

Another possible document encoding is word2vec. This is better for retaining contextual information between phrases.

We can then use classical machine learning methods on the data. We need a way of ranking similarity between the encoded document. Once we have this document, we can use a clustering algorithm with the similarity metric to cluster documents by similarity.

nicola-sorace commented 5 years ago

What if instead of clustering, once we've mapped all the articles into n-dimensional space we just estimate a function that matches the sentiment score at any point? That way we don't have to isolate different stories? Not sure if that works.