(한국어) 텍스트 마이닝을 위한 튜토리얼

텍스트 마이닝을 공부하기 위한 자료입니다. 언어에 상관없이 적용할 수 있는 자연어처리 / 머신러닝 관련 자료도 포함되지만, 한국어 분석을 위한 자료들도 포함됩니다.

이 자료는 현재 작업중이며, slide와 jupyter notebook example codes가 포함되어 있습니다.
이 자료는 soynlp package를 이용합니다. 한국어 분석을 위한 자연어처리 코드입니다. soynlp 역시 현재 작업중입니다.
Slides 내용에 관련된 texts 는 blog 에 포스팅 중입니다.
실습코드는 코드 repository 에 있습니다.

Python basic
1. jupyter tutorial
From text to vector (KoNLPy)
1. [x] n-gram
2. [x] from text to vector using KoNLPy
Word extraction and tokenization (Korean)
1. [x] word extractor
2. [x] unsupervised tokenizer
3. [x] noun extractor
4. [x] dictionary based pos tagger
Document classification
1. [x] Logistic Regression and Lasso regression
2. [x] SVM (linear, RBF)
3. [x] k-nearest neighbors classifier
4. [x] Feed-forward neural network
5. [x] Decision Tree
6. [x] Naive Bayes
Sequential labeling
1. [x] Conditional Random Field
Embedding for representation
1. [x] Word2Vec / Doc2Vec
2. [x] GloVe
3. [x] FastText (word embedding using subword)
4. [x] FastText (supervised word embedding)
5. [x] Sparse Coding
6. [x] Nonnegative Matrix Factorization (NMF) for topic modeling
Embedding for vector visualization
1. [x] MDS, ISOMAP, Locally Linear Embedding, PCA, Kernel PCA
2. [x] t-SNE
3. [ ] t-SNE (detailed)
Keyword / Related words analysis
1. [x] co-occurrence based keyword / related word analysis
Document clustering
1. [x] k-means is good for document clustering
2. [x] DBSCAN, hierarchical, GMM, BGMM are not appropriate for document clustering
Finding similar documents (neighbor search)
1. [x] Random Projection
2. [x] Locality Sensitive Hashing
3. [x] Inverted Index
Graph similarity and ranking (centrality)
1. [x] SimRank & Random Walk with Restart
2. [x] PageRank, HITS, WordRank, TextRank
3. [x] kr-wordrank keyword extraction
String similarity
1. [x] Levenshtein / Cosine / Jaccard distance
Convolutional Neural Network (CNN)
1. [x] Introduction of CNN
2. [x] Word-level CNN for sentence classification (Yoon Kim)
3. [x] Character-level CNN (LeCun)
4. [x] BOW-CNN
Recurrent Neural Network (RNN)
1. [x] Introduction of RNN
2. [x] LSTM, GRU
3. [x] Deep RNN & ELMo
4. [x] Sequence to sequence & seq2seq with attention
5. [x] Skip-thought vector
6. [x] Attention mechanism for sentence classification
7. [x] Hierarchical Attention Network (HAN) for document classification
8. [x] Transformer & BERT
Applications
1. [x] soyspacing: heuristic Korean space correction
2. [x] crf-based Korean soace correction
3. [x] HMM & CRF-based part-of-speech tagger (morphological analyzer)
4. [ ] semantic movie search using IMDB
TBD

Thanks to

자료를 리뷰하고 함께 토론해주는 고마운 동료들이 많습니다. 특히 많은 시간과 정성을 들여 도와주는 태욱에게 고마움을 표합니다.

lovit / textmining-tutorial

readme

(한국어) 텍스트 마이닝을 위한 튜토리얼

Contents

Thanks to