rasbt / python-machine-learning-book-3rd-edition

The "Python Machine Learning (3rd edition)" book code repository
https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1789955750/
MIT License
4.6k stars 1.99k forks source link

Definintion of tokenizer in 8.3 Working with bigger data #160

Closed niwak01 closed 2 years ago

niwak01 commented 2 years ago

I think it should be modified as the comment below .

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    #emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())                 # "text.lower()" should be "text"
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized
rasbt commented 2 years ago

Thanks for the suggestion!

Hm, I don't recall why I used text.lower() up there and wonder if there was a specific reason, or whether it's really unnecessary as you point out 🤔.

Doing a quick test, it seems like they both produce the same results. But yeah, maybe there was a special edge case that I had in mind. Hmm...

niwak01 commented 2 years ago

Thank you very much for your response.

I think you might have intended to write as follows emoticons = re.findall('(?::|;|=)(?:-)?(?:)|(|d|p)', text.lower()) in order to find not only ':-P', but also ':-p)'

rasbt commented 2 years ago

Oh this makes absolute sense, thanks a lot! Changed it!