Movie Rate Prediction

영화 평점 예측 with Tensorflow

Environments

OS : Ubuntu 16.04+ / Windows 10
CPU : any (quad core ~)
GPU : GTX 1060 6GB ~
RAM : 16GB ~
Library : TF 1.x with CUDA 9.0~ + cuDNN 7.0~

Prerequisites

Python
MySQL DB
tensorflow 1.x
numpy
gensim and konlpy and soynlp
mecab-ko
pymysql
h5py
tqdm
pymysql
(Optional) java 1.7+
(Optional) PyKoSpacing
(Optional) MultiTSNE (for visualization)
(Optional) matplotlib (for visualization)

DataSet

DataSet	Language	Sentences	Words	Size
NAVER Movie Review	Korean	`8.86M` \| `391K` \| `About 1GB`

Movie Review Data Distribution

dist

Usage

1.1 Installing Dependencies

# Necessary
$ sudo python3 -m pip install -r requirements.txt
# Optional
$ sudo python3 -m pip install -r opt_requirements.txt

1.2 Configuration

# In ```config.py```, there're lots of params for scripts. plz re-setting

2. Parsing the DataSet

$ python3 movie-parse.py

3. Making DataSet DB

$ python3 db.py

4. Making w2v/d2v embeddings (skip if u only wanna use Char2Vec)

$ python3 preprocessing.py

usage: preprocessing.py [-h] [--load_from {db,csv}] [--vector {d2v,w2v}]
                        [--is_analyzed IS_ANALYZED]

Pre-Processing NAVER Movie Review Comment

optional arguments:
  -h, --help            show this help message and exit
  --load_from {db,csv}  load DataSet from db or csv
  --vector {d2v,w2v}    d2v or w2v
  --is_analyzed IS_ANALYZED
                        already analyzed data

5. Training a Model

$ python3 main.py --refine_data [True or False]

usage: main.py [-h] [--checkpoint CHECKPOINT] [--refine_data REFINE_DATA]

train/test movie review classification model

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        pre-trained model
  --refine_data REFINE_DATA
                        solving data imbalance problem

Repo Tree

│
├── comments          (NAVER Movie Review DataSets)
│    ├── 10000.sql
│    ├── ...
│    └── 200000.sql
├── w2v               (Word2Vec)
│    ├── ko_w2v.model (Word2Vec trained gensim model)
│    └── ...
├── d2v               (Doc2Vec)
│    ├── ko_d2v.model (Dov2Vec trained gensim model)
│    └── ...
├── model             (Movie Review Rate ML Models)
│    ├── textcnn.py
│    └── textrnn.py
├── image             (explaination images)
│    └── *.png
├── ml_model          (tf pre-trained model saved in here)
│    ├── checkpoint
│    ├── ...
│    └── charcnn-best_loss.ckpt
├── config.py         (Configuration)
├── tfutil.py         (handy tfutils)
├── dataloader.py     (Doc/Word2Vec model loader)
├── movie-parser.py   (NAVER Movie Review Parser)
├── db.py             (DataBase processing)
├── preprocessing.py  (Korean normalize/tokenize)
├── visualize.py      (for visualizing w2v)
└── main.py           (for easy use of train/test)

Pre-Trained Models

Here's a google drive link. You can download pre-trained models from here !

Embedding Models
- Word2Vec model : here
M.L Models
- TextCNN model : here
- TextRNN model : here

Models

TextCNN

credited by Toxic Comment Classification kaggle 1st solution

TextRNN

credited by Toxic Comment Classification kaggle 1st solution

Results

DataSet is not good. So, the result also isn't pretty good as i expected :(
Refining/Normalizing raw sentences are needed!

TextCNN (Char2Vec)

Result : train MSE 1.553, val MSE 3.341
Hyper-Parameter : rand, conv kernel size [10,9,7,5,3], conv filters 256, drop out 0.7, fc unit 1024, adam, embed size 384

TextCNN (Word2Vec)

Result : train MSE 3.410
Hyper-Parameter : non-static, conv kernel size [2,3,4,5], conv filters 256, drop out 0.7, fc unit 1024, adadelta, embed size 300

TextRNN (Word2Vec)

Result : train MSE 3.646
Hyper-Parameter : non-static, rnn cells 128, attention 128, drop out 0.7, fc unit 1024, adadelta, embed size 300

TextRNN (Char2Vec)

SOON!

Visualization

You can just simply type tensorboard --logdir=./ml_model/

Word2Vec Embeddings (t-SNE)

Perplexity : 80
Learning rate : 10
Iteration : 310

To-Do

deal with word spacing problem

ETC

Any suggestions and PRs and issues are WELCONE :)

Author

HyeongChan Kim / @kozistr

kozistr / movie-rate-prediction

readme