rkcosmos / deepcut

A Thai word tokenization library using Deep Neural Network
MIT License
420 stars 96 forks source link

Script for training model, remove scikit-learn label encoder #4

Closed titipata closed 7 years ago

titipata commented 7 years ago

[Work in Progress] @rkcosmos, I'm trying to create reproducible script for training the model (now in notebook in Research-Notebook folder).

With this PR, it takes BEST path and save CSV files to the given folder in output_path. Then use train_model in order to train model from given cleaned BEST path.

import deepcut

path_to_best = '' # path to unzipped BEST dataset
best_processed_path = '/cleaned_data/'
deepcut.train.generate_best_dataset(path_to_best, output_path=best_processed_path) 
model = deepcut.train.train_model(best_processed_path) # train model

I checked the output char_le, it is slightly different compared to current char_le from pickle file.

It would be great if you can suggest how you want it to be. Happy to chat more later on!

rkcosmos commented 7 years ago

labelEncoder has an issue with different version of sklearn. It is best to get rid of it and create manual dictionary for mapping like {'a': 0, 'b': 1, etc.} instead of relying on sklearn's function. I will take care of this (+ hopefully improving network architecture) and rerun all training in a few days.

titipata commented 7 years ago

@rkcosmos, taken care of in this PR :). I put dictionary as you mentioned in this PR instead of LabelEncoder. I didn't remove object.pk, however, you can remove it later on.

titipata commented 7 years ago

I think it's ready to be reviewed, maybe merge later. The current workflow for training model looks like the following:

import deepcut

# preprocess
best_path = ''
best_processed_path = 'cleaned_data/'
deepcut.train.generate_best_dataset(best_path, output_path=best_processed_path) 

# training
x_train_char, x_train_type, y_train = deepcut.train.prepare_feature(best_processed_path, option='train')
model = deepcut.model.get_convo_nn2()
model.fit([x_train_char, x_train_type], y_train, epochs=10, batch_size=256, verbose=1)
model.fit([x_train_char, x_train_type], y_train, epochs=3, batch_size=512, verbose=1)
model.fit([x_train_char, x_train_type], y_train, epochs=3, batch_size=2048, verbose=1)
model.fit([x_train_char, x_train_type], y_train, epochs=3, batch_size=4096, verbose=1)
model.fit([x_train_char, x_train_type], y_train, epochs=3, batch_size=8192, verbose=1)

# evaluating
f1score, precision, recall = deepcut.train.evaluate(best_processed_path, model)