joaoftrodrigues / imdb-binary-sentiment-analysis

Sentiment analysis applied to imdb reviews, classifying as positive or negative.

0 stars 0 forks source link

Preprocessing and Sentiment Lexicon #4

Open joaoftrodrigues opened 1 year ago

joaoftrodrigues commented 1 year ago

Description

First stage of the project where is done data preprocessing. Additionally, is meant to use an approach of sentiment lexicon.

Lexicon being used: NCR

Preprocessings planned to use

[x] Tokenization
[x] Lemmatization
[x] Remove Entities
[x] Lower Case
[x] Remove Punctuation
[ ] Apply Word correction
[ ] Long Words

joaoftrodrigues commented 1 year ago

Tokenization

First approach is based on a directly use of lexicon ( summing values to get a positive or negative label ), after a tokenization. Next step is to add lemmatization, to try capture words that would not be captured by lexicon.

Accuracy: 65.38%

joaoftrodrigues commented 1 year ago

+ Lemmatization

To try capture more words from lexicon, was applied lemmatization, as the words on lexicon are all lemma of words. This lead to a slightly decay on accuracy, which is not necessarily bad, there could be some adjustments yet to enhance the system.

Next step is to lower case words, as NCR works with lower case words. But for that, first, entities will be removed, to avoid conflicts/missinterpretation of words.

Accuracy: 65.26%

joaoftrodrigues commented 1 year ago

+ LowerCase + Entities Removal

To apply lower case, first is needed to remove entities, to avoid misinterpretations.

Removal of Entities

Processing time increased from 56.511 seconds to 252.185 seconds. Accuracy maintains in 65.26%. Could increase speed to remove punctuation (next thing to try).

Removal of Punctuation

Accuracy increased a bit, enough to approximate more 0.01% . Time of execution is now 271.757 seconds, but further processing will be done with less tokens.

Lowering case

Accuracy raised to 65.33%. Interestingly, accuracy was a bit higher, when applying lower case without removal of entities, but that wouldn't be a good practice.