keras-text is a one-stop text classification library implementing various state of the art models with a clean and extendable interface to implement custom architectures.
(docs, words)
use WordTokenizer
(docs, sentences, words)
use SentenceWordTokenizer
Tokenizer
and implement the token_generator
method.from keras_text.processing import WordTokenizer
tokenizer = WordTokenizer()
tokenizer.build_vocab(texts)
Want to tokenize with character tokens to leverage character models? Use CharTokenizer
.
A dataset encapsulates tokenizer, X, y and the test set. This allows you to focus your efforts on trying various architectures/hyperparameters without having to worry about inconsistent evaluation. A dataset can be saved and loaded from the disk.
from keras_text.data import Dataset
ds = Dataset(X, y, tokenizer=tokenizer)
ds.update_test_indices(test_size=0.1)
ds.save('dataset')
The update_test_indices
method automatically stratifies multi-class or multi-label data correctly.
See tests/ folder for usage.
When dataset represented as (docs, words)
word based models can be created using TokenModelFactory
.
from keras_text.models import TokenModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN
# RNN models can use `max_tokens=None` to indicate variable length words per mini-batch.
factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()
Currently supported models include:
TokenModelFactory.build_model
uses the provided word encoder which is then classified via Dense
block.
When dataset represented as (docs, sentences, words)
sentence based models can be created using SentenceModelFactory
.
from keras_text.models import SentenceModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN, AveragingEncoder
# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.
factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500, max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()
# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()
Currently supported models include:
SentenceModelFactory.build_model
created a tiered model where words within a sentence is first encoded using
word_encoder_model
. All such encodings per sentence is then encoded using sentence_encoder_model
.
token_encoder_model=AveragingEncoder()
TODO: Update documentation and add notebook examples.
Stay tuned for better documentation and examples. Until then, the best resource is to refer to the API docs
1) Install keras with theano or tensorflow backend. Note that this library requires Keras > 2.0
2) Install keras-text
From sources
sudo python setup.py install
PyPI package
sudo pip install keras-text
3) Download target spacy model
keras-text uses the excellent spacy library for tokenization. See instructions on how to download model for target language.
Please cite keras-text in your publications if it helped your research. Here is an example BibTeX entry:
@misc{raghakotkerastext
title={keras-text},
author={Kotikalapudi, Raghavendra and contributors},
year={2017},
publisher={GitHub},
howpublished={\url{https://github.com/raghakot/keras-text}},
}