kozistr / movie-rate-prediction

Movie Rate Prediction with Tensorflow
MIT License
5 stars 2 forks source link
char2vec doc2vec gensim korean-nlp mecab-ko movie-review-classifier naver soynlp tensorflow text-classification textcnn textrnn word2vec

Movie Rate Prediction

영화 평점 예측 with Tensorflow

License: MIT Total alerts Language grade: Python

Environments

Prerequisites

DataSet

DataSet Language Sentences Words Size
NAVER Movie Review Korean 8.86M | 391K | About 1GB

Movie Review Data Distribution

dist

Usage

1.1 Installing Dependencies

# Necessary
$ sudo python3 -m pip install -r requirements.txt
# Optional
$ sudo python3 -m pip install -r opt_requirements.txt

1.2 Configuration

# In ```config.py```, there're lots of params for scripts. plz re-setting

2. Parsing the DataSet

$ python3 movie-parse.py

3. Making DataSet DB

$ python3 db.py

4. Making w2v/d2v embeddings (skip if u only wanna use Char2Vec)

$ python3 preprocessing.py

usage: preprocessing.py [-h] [--load_from {db,csv}] [--vector {d2v,w2v}]
                        [--is_analyzed IS_ANALYZED]

Pre-Processing NAVER Movie Review Comment

optional arguments:
  -h, --help            show this help message and exit
  --load_from {db,csv}  load DataSet from db or csv
  --vector {d2v,w2v}    d2v or w2v
  --is_analyzed IS_ANALYZED
                        already analyzed data

5. Training a Model

$ python3 main.py --refine_data [True or False]

usage: main.py [-h] [--checkpoint CHECKPOINT] [--refine_data REFINE_DATA]

train/test movie review classification model

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        pre-trained model
  --refine_data REFINE_DATA
                        solving data imbalance problem

Repo Tree

│
├── comments          (NAVER Movie Review DataSets)
│    ├── 10000.sql
│    ├── ...
│    └── 200000.sql
├── w2v               (Word2Vec)
│    ├── ko_w2v.model (Word2Vec trained gensim model)
│    └── ...
├── d2v               (Doc2Vec)
│    ├── ko_d2v.model (Dov2Vec trained gensim model)
│    └── ...
├── model             (Movie Review Rate ML Models)
│    ├── textcnn.py
│    └── textrnn.py
├── image             (explaination images)
│    └── *.png
├── ml_model          (tf pre-trained model saved in here)
│    ├── checkpoint
│    ├── ...
│    └── charcnn-best_loss.ckpt
├── config.py         (Configuration)
├── tfutil.py         (handy tfutils)
├── dataloader.py     (Doc/Word2Vec model loader)
├── movie-parser.py   (NAVER Movie Review Parser)
├── db.py             (DataBase processing)
├── preprocessing.py  (Korean normalize/tokenize)
├── visualize.py      (for visualizing w2v)
└── main.py           (for easy use of train/test)

Pre-Trained Models

Here's a google drive link. You can download pre-trained models from here !

Models

img

credited by Toxic Comment Classification kaggle 1st solution

img

credited by Toxic Comment Classification kaggle 1st solution

Results

DataSet is not good. So, the result also isn't pretty good as i expected :(
Refining/Normalizing raw sentences are needed!

img

Result : train MSE 1.553, val MSE 3.341
Hyper-Parameter : rand, conv kernel size [10,9,7,5,3], conv filters 256, drop out 0.7, fc unit 1024, adam, embed size 384

img

Result : train MSE 3.410
Hyper-Parameter : non-static, conv kernel size [2,3,4,5], conv filters 256, drop out 0.7, fc unit 1024, adadelta, embed size 300

img

Result : train MSE 3.646
Hyper-Parameter : non-static, rnn cells 128, attention 128, drop out 0.7, fc unit 1024, adadelta, embed size 300

SOON!

Visualization

You can just simply type tensorboard --logdir=./ml_model/

Word2Vec Embeddings (t-SNE)

img

Perplexity : 80
Learning rate : 10
Iteration : 310

To-Do

  1. deal with word spacing problem

ETC

Any suggestions and PRs and issues are WELCONE :)

Author

HyeongChan Kim / @kozistr