This project demonstrates how to preprocess text and calculate text similarity using Spacy and Sklearn libraries. The process involves lemmatizing the text and using the CountVectorizer
to transform the text data into vectors. Cosine similarity is then calculated to determine the similarity between different texts.
The necessary libraries are imported, including Spacy for natural language processing, Sklearn for vectorization and similarity calculation, and Matplotlib for visualization.
The text is first lemmatized using Spacy to reduce words to their base forms.
Using CountVectorizer
, the lemmatized text is transformed into vectors.
Cosine similarity is calculated to determine the similarity between the two text vectors.
You can visualize the similarity matrix using Matplotlib.
When comparing the two example texts:
The cosine similarity matrix will look like this:
When comparing the two example of Non-similar texts:
The cosine similarity matrix will look like this: