lukasgarbas / nlp-text-emotion

Multi-class sentiment analysis lstm, finetuned bert
206 stars 80 forks source link

normalize() argument 2 must be str, not float #2

Open harshitaggarwal01 opened 3 years ago

harshitaggarwal01 commented 3 years ago

In the preprocessing block in bert.ipynb , the following error is shown:

(I have also tried in google colab but it gives the same error)

bert

TypeError Traceback (most recent call last)

in 4 preprocess_mode='bert', 5 maxlen=350, ----> 6 max_features=35000) ~\Anaconda3\envs\tensorflow\lib\site-packages\ktrain\text\data.py in texts_from_array(x_train, y_train, x_test, y_test, class_names, max_features, maxlen, val_pct, ngram_range, preprocess_mode, lang, random_state, verbose) 365 class_names = class_names, 366 lang=lang, ngram_range=ngram_range) --> 367 trn = preproc.preprocess_train(x_train, y_train, verbose=verbose) 368 val = preproc.preprocess_test(x_test, y_test, verbose=verbose) 369 if not preproc.get_classes() and verbose: ~\Anaconda3\envs\tensorflow\lib\site-packages\ktrain\text\preprocessor.py in preprocess_train(self, texts, y, mode, verbose) 759 U.vprint('language: %s' % (self.lang), verbose=verbose) 760 --> 761 x = bert_tokenize(texts, self.tok, self.maxlen, verbose=verbose) 762 763 # transform y ~\Anaconda3\envs\tensorflow\lib\site-packages\ktrain\text\preprocessor.py in bert_tokenize(docs, tokenizer, maxlen, verbose) 157 for i in mb: 158 for doc in progress_bar(docs, parent=mb): --> 159 ids, segments = tokenizer.encode(doc, max_len=maxlen) 160 indices.append(ids) 161 if verbose: mb.write('done.') ~\Anaconda3\envs\tensorflow\lib\site-packages\keras_bert\tokenizer.py in encode(self, first, second, max_len) 71 72 def encode(self, first, second=None, max_len=None): ---> 73 first_tokens = self._tokenize(first) 74 second_tokens = self._tokenize(second) if second is not None else None 75 self._truncate(first_tokens, second_tokens, max_len) ~\Anaconda3\envs\tensorflow\lib\site-packages\keras_bert\tokenizer.py in _tokenize(self, text) 101 def _tokenize(self, text): 102 if not self._cased: --> 103 text = unicodedata.normalize('NFD', text) 104 text = ''.join([ch for ch in text if unicodedata.category(ch) != 'Mn']) 105 text = text.lower() TypeError: normalize() argument 2 must be str, not float
christar1225 commented 1 year ago

What is answer for this problem?

kishan2k2 commented 1 year ago

Donot encode the labels i.e. skip the previous line of code encoding = { 'joy': 0, 'sadness': 1, 'fear': 2, 'anger': 3, 'neutral': 4 }

Integer values for each class

y_train = [encoding[x] for x in y_train] y_test = [encoding[x] for x in y_test]