robinarthur / 5pk

0 stars 0 forks source link

Go further with mining #6

Open robinarthur opened 6 years ago

robinarthur commented 6 years ago

Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:

para = "Hello World. It's good to see you. Thanks for buying this book." Now we want to split para into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument. from nltk.tokenize import sent_tokenize sent_tokenize(para) ['Hello World.', "It's good to see you.", 'Thanks for buying this book.'] So now we have a list of sentences that we can use for further processing.

How it works... sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk. tokenize.punkt module. This instance has already been trained on and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence. There's more... The instance used in sent_tokenize() is actually loaded on demand from a pickle file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the PunktSentenceTokenizer once, and call its tokenize() method instead.

import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') tokenizer.tokenize(para) ['Hello World.', "It's good to see you.", 'Thanks for buying this book.'] Other languages If you want to tokenize sentences in languages other than English, you can load one of the other pickle files in tokenizers/punkt and use it just like the English sentence tokenizer. Here's an example for Spanish: spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish. pickle') spanish_tokenizer.tokenize('Hola amigo. Estoy bien.') See also In the next recipe, we'll learn how to split sentences into individual words. After that, we'll cover how to use regular expressions for tokenizing text. Tokenizing sentences into words In this recipe, we'll split a sentence into individual words. The simple task of creating a list of words from a string is an essential part of all text processing.

How to do it... Basic word tokenization is very simple: use the word_tokenize() function:

from nltk.tokenize import word_tokenize word_tokenize('Hello World.') ['Hello', 'World', '.'] How it works... word_tokenize() is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer. It's equivalent to the following: from nltk.tokenize import TreebankWordTokenizer tokenizer = TreebankWordTokenizer() tokenizer.tokenize('Hello World.') ['Hello', 'World', '.'] It works by separating words using spaces and punctuation. And as you can see, it does not discard the punctuation, allowing you to decide what to do with it.

robinarthur commented 6 years ago

pos tagging in german https://stackoverflow.com/q/1639855/7477664