steinnp / Big-Data-Final

Final project in The course Big data in media technology in KTH 1st period 2017
MIT License
0 stars 0 forks source link

Data preprocessing #2

Open steinnp opened 6 years ago

steinnp commented 6 years ago

We need to preprocess the data for our classifier. There are multiple methods we need to explore

steinnp commented 6 years ago

In general, we first build the vocabulary of the corpus and then we generate word count vector from each file which is nothing but frequency of words present in the vocabulary.But there are limitations in this conventional approach of extracting features as listed below: a) Frequently occurring words present in all files of corpus irrespective of the sentiment, like in this case, ‘movie’, ‘acting’, etc will be treated equally like other distinguishing words in the document.
b) Stop words will be present in the vocab if not processed properly.
c) Rare words or key words which can be distinguishing will not get special weight. Here comes our tf-idf weighting factor which eliminates these limitations

TF-IDF Strategy Rather that just counting we can use the TF-IDF score of word to rank its importance. The tiff score of a word , w, is: tf(w)*idf(w) Where tf(w) = (Number of times the word appears in a document) / (Total number of words in the document) and where idf(w) = log(Number of documents / Number of documents that contain a word w). So, We can implement the TF-IDF by ourself for converting our words to numbers but we do not need, because TF-IDF Strategy is sklearn library that we can use for count of numbers. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< stopset =(stop words.words(‘english’)) stopset.update([‘span’,’font’,’weighth’,’on’])

Stopwords is concept is being used with tf-idf if we wanna throw some words from the resource like junk words. give us #opportunity to clean the document if do not want to convert these words as feature.

TF-IDF vectoring, instead of our implemented TF-idf code we can use below from sklearn library

vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram=(1, 3))

top_words=stopset that we already defined above and ngram=(1, 3) is for tokenising the words, normally we count each word but sometimes it can be important to see the accuracy of 2 or 3 words together

x= vectorizer.fit_transform(df.txt)

x is a vector. and we can call x as below for first document in the corpus

x[0] print x[0]

or

print (x.shape)

or

x.shape

Here is another example of TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?'] vect = TfidfVectorizer() X = vect.fit_transform(corpus) X.todense() matrix([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,

  1. , 0.35872874, 0. , 0.43877674], [ 0. , 0.27230147, 0. , 0.27230147, 0. , 0.85322574, 0.22262429, 0. , 0.27230147], [ 0.55280532, 0. , 0. , 0. , 0.55280532,
  2. , 0.28847675, 0.55280532, 0. ], [ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
  3. , 0.35872874, 0. , 0.43877674]])
steinnp commented 6 years ago

Look into word and sentence tokenization

steinnp commented 6 years ago

Explore different types of stemmers for different classifiers

steinnp commented 6 years ago

POS-tagging

steinnp commented 6 years ago

Spelling/grammatical error identificationand recovery