prathimacode-hub / DS-ScriptsNook

🎊One Stop Destination to get acquainted with scripts in Data Science. Turn yourself into a pro. Show your support by ✨ this repository.
https://prathimacode-hub.github.io/DS-ScriptsNook/
MIT License
11 stars 25 forks source link

Word Tokenization Techniques in NLP #22

Closed shivani6320 closed 2 years ago

shivani6320 commented 2 years ago

Title: Natural Language Processing/Algorithms/ Word tokenization

About: I would like to perform different word tokenization techniques on text data with explanation.

Name: Shivani Rana

Label: Feature Request

Define You:

Is your feature request related to a problem? Please describe. My feature requests to add an algorithm in NLP subject. What is tokenization? Tokenization is the process of breaking text into smaller pieces called tokens. These smaller pieces can be sentences, words, or sub-words. For example, the sentence “I won” can be tokenized into two word-tokens “I” and “won”.

Describe the solution you'd like...

I would describe different word tokenization techniques like Whitespace tokenization,Punctuation-based tokenization,Default/TreebankWordTokenizer,TweetTokenizer etc.. and practical implementation of it.

I would like to work on this issue. @prathimacode-hub Please assign me.
prathimacode-hub commented 2 years ago

Issue assigned. @shivani6320