lovit / textmining-tutorial

(한국어) 텍스트 마이닝을 위한 공부거리들
203 stars 61 forks source link
korean-nlp lecture-notes

(한국어) 텍스트 마이닝을 위한 튜토리얼

텍스트 마이닝을 공부하기 위한 자료입니다. 언어에 상관없이 적용할 수 있는 자연어처리 / 머신러닝 관련 자료도 포함되지만, 한국어 분석을 위한 자료들도 포함됩니다.

Contents

  1. Python basic
    1. jupyter tutorial
  2. From text to vector (KoNLPy)
    1. [x] n-gram
    2. [x] from text to vector using KoNLPy
  3. Word extraction and tokenization (Korean)
    1. [x] word extractor
    2. [x] unsupervised tokenizer
    3. [x] noun extractor
    4. [x] dictionary based pos tagger
  4. Document classification
    1. [x] Logistic Regression and Lasso regression
    2. [x] SVM (linear, RBF)
    3. [x] k-nearest neighbors classifier
    4. [x] Feed-forward neural network
    5. [x] Decision Tree
    6. [x] Naive Bayes
  5. Sequential labeling
    1. [x] Conditional Random Field
  6. Embedding for representation
    1. [x] Word2Vec / Doc2Vec
    2. [x] GloVe
    3. [x] FastText (word embedding using subword)
    4. [x] FastText (supervised word embedding)
    5. [x] Sparse Coding
    6. [x] Nonnegative Matrix Factorization (NMF) for topic modeling
  7. Embedding for vector visualization
    1. [x] MDS, ISOMAP, Locally Linear Embedding, PCA, Kernel PCA
    2. [x] t-SNE
    3. [ ] t-SNE (detailed)
  8. Keyword / Related words analysis
    1. [x] co-occurrence based keyword / related word analysis
  9. Document clustering
    1. [x] k-means is good for document clustering
    2. [x] DBSCAN, hierarchical, GMM, BGMM are not appropriate for document clustering
  10. Finding similar documents (neighbor search)
    1. [x] Random Projection
    2. [x] Locality Sensitive Hashing
    3. [x] Inverted Index
  11. Graph similarity and ranking (centrality)
    1. [x] SimRank & Random Walk with Restart
    2. [x] PageRank, HITS, WordRank, TextRank
    3. [x] kr-wordrank keyword extraction
  12. String similarity
    1. [x] Levenshtein / Cosine / Jaccard distance
  13. Convolutional Neural Network (CNN)
    1. [x] Introduction of CNN
    2. [x] Word-level CNN for sentence classification (Yoon Kim)
    3. [x] Character-level CNN (LeCun)
    4. [x] BOW-CNN
  14. Recurrent Neural Network (RNN)
    1. [x] Introduction of RNN
    2. [x] LSTM, GRU
    3. [x] Deep RNN & ELMo
    4. [x] Sequence to sequence & seq2seq with attention
    5. [x] Skip-thought vector
    6. [x] Attention mechanism for sentence classification
    7. [x] Hierarchical Attention Network (HAN) for document classification
    8. [x] Transformer & BERT
  15. Applications
    1. [x] soyspacing: heuristic Korean space correction
    2. [x] crf-based Korean soace correction
    3. [x] HMM & CRF-based part-of-speech tagger (morphological analyzer)
    4. [ ] semantic movie search using IMDB
  16. TBD

Thanks to

자료를 리뷰하고 함께 토론해주는 고마운 동료들이 많습니다. 특히 많은 시간과 정성을 들여 도와주는 태욱에게 고마움을 표합니다.