taranjeet/hindi-tokenizer

Tokenizer for Hindi

This package tends to implement a Tokenizer and a stemmer for Hindi language.

To import the package,

from HindiTokenizer import Tokenizer

This package implements various funcions, which are listed as below:

The Tokenizer can be created in two ways

t=Tokenizer("यह वाक्य हिन्दी में है।")

t=Tokenizer()
t.read_from_file('filename_here')

A brief description about all the functions

This function takes the name of the file which is present in the current directory and reads it.

t.read_from_file('hindi_file.txt')

Given a text, this will generate a list of sentences.

t.generate_sentences()

This will print the sentences generated by print_sentences.

t.generate_sentences()
t.print_sentences()

This will generate a list of tokens from the given text

t.tokenize()

This will print the sentences generated by print_tokens.

t.tokenize()
t.print_tokens()

This will generate a dictionary of frequency of words and return it.

freq_dict=t.generate_freq_dict()

This will print the dictionary of frequency of words generated by generate_freq_dict.

freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)

Given a word, this will generate its stem word.

word=t.generate_stem_word("भारतीय")
print word
भारत

This will return the dictionary of stemmed words.

stem_dict=t.generate_stem_dict()

This will print the dictionary of stemmed words generated by generate_stem_dict.

stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)

This will remove all the stopwords occuring from the given text.

t.remove_stopwords()

This will remove all the punctuation symbols occuring in the given text.

t.clean_text()

Given a text, this will return the length of it.

print t.len_text()

Given a text, this will return the number of sentences in it.

print t.sentence_count()

Given a text, this will return the number of tokens in it.

print t.tokens_count()

Given a text, and a word, it will print all the sentences where that word is occuring.

sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)

taranjeet / hindi-tokenizer