Movie Rate Prediction with Tensorflow
MIT License
char2vec doc2vec gensim korean-nlp mecab-ko movie-review-classifier naver soynlp tensorflow text-classification textcnn textrnn word2vec

Movie Rate Prediction

영화 평점 예측 with Tensorflow

DataSet Language Sentences Words Size
NAVER Movie Review Korean 8.86M | 391K | About 1GB

Movie Review Data Distribution



1.1 Installing Dependencies

# Necessary
$ sudo python3 -m pip install -r requirements.txt
# Optional
$ sudo python3 -m pip install -r opt_requirements.txt

1.2 Configuration

# In ```config.py```, there're lots of params for scripts. plz re-setting

2. Parsing the DataSet

$ python3 movie-parse.py

3. Making DataSet DB

$ python3 db.py

4. Making w2v/d2v embeddings (skip if u only wanna use Char2Vec)

$ python3 preprocessing.py

usage: preprocessing.py [-h] [--load_from {db,csv}] [--vector {d2v,w2v}]
                        [--is_analyzed IS_ANALYZED]

Pre-Processing NAVER Movie Review Comment

optional arguments:
  -h, --help            show this help message and exit
  --load_from {db,csv}  load DataSet from db or csv
  --vector {d2v,w2v}    d2v or w2v
  --is_analyzed IS_ANALYZED
                        already analyzed data

5. Training a Model

$ python3 main.py --refine_data [True or False]

usage: main.py [-h] [--checkpoint CHECKPOINT] [--refine_data REFINE_DATA]

train/test movie review classification model

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        pre-trained model
  --refine_data REFINE_DATA
                        solving data imbalance problem

Repo Tree

├── comments          (NAVER Movie Review DataSets)
│    ├── 10000.sql
│    ├── ...
│    └── 200000.sql
├── w2v               (Word2Vec)
│    ├── ko_w2v.model (Word2Vec trained gensim model)
│    └── ...
├── d2v               (Doc2Vec)
│    ├── ko_d2v.model (Dov2Vec trained gensim model)
│    └── ...
├── model             (Movie Review Rate ML Models)
│    ├── textcnn.py
│    └── textrnn.py
├── image             (explaination images)
│    └── *.png
├── ml_model          (tf pre-trained model saved in here)
│    ├── checkpoint
│    ├── ...
│    └── charcnn-best_loss.ckpt
├── config.py         (Configuration)
├── tfutil.py         (handy tfutils)
├── dataloader.py     (Doc/Word2Vec model loader)
├── movie-parser.py   (NAVER Movie Review Parser)
├── db.py             (DataBase processing)
├── preprocessing.py  (Korean normalize/tokenize)
├── visualize.py      (for visualizing w2v)
└── main.py           (for easy use of train/test)

Pre-Trained Models

Here's a google drive link. You can download pre-trained models from here !



credited by Toxic Comment Classification kaggle 1st solution


DataSet is not good. So, the result also isn't pretty good as i expected :(
Refining/Normalizing raw sentences are needed!


Result : train MSE 1.553, val MSE 3.341
Hyper-Parameter : rand, conv kernel size [10,9,7,5,3], conv filters 256, drop out 0.7, fc unit 1024, adam, embed size 384


Result : train MSE 3.410
Hyper-Parameter : non-static, conv kernel size [2,3,4,5], conv filters 256, drop out 0.7, fc unit 1024, adadelta, embed size 300


Result : train MSE 3.646
Hyper-Parameter : non-static, rnn cells 128, attention 128, drop out 0.7, fc unit 1024, adadelta, embed size 300



You can just simply type tensorboard --logdir=./ml_model/

Word2Vec Embeddings (t-SNE)


Perplexity : 80
Learning rate : 10
Iteration : 310


  1. deal with word spacing problem


Any suggestions and PRs and issues are WELCONE :)


HyeongChan Kim / @kozistr