Open guevara opened 1 month ago
Basic NLP Knowledge https://ift.tt/xzUOENZ
Notes of Stanford NLP course
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$
$$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$
$$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$
$$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$
p(misspelling|word)
$$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P(c) $$
Precision: % of selected items that are correct
Recall: % of correct items that are selected
$$ F=\tfrac{1}{\alpha \tfrac{1}{P} +(1-\alpha)\tfrac{1}{R}}=\tfrac{(\beta^2+1)PR}{\beta^2P+R} $$
P(c|d)
$$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c'} exp\sum_i \lambda_if_i(c',d)} $$
$$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$
$$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum^{|V|}{i=1}q_id_i}{\sqrt{\sum^{|V|}{i=1}q_i^2}\sqrt{\sum^{|V|}_{i=1}d^2_i}} $$
$LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$
$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$
$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$
$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\log P(c_1)+\log P(c_2)} $$
$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$
$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$
$$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$
$$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$
$$ weight(w_i)= \begin{cases} 1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise \end{cases} $$
$$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$
Basic NLP Knowledge
https://ift.tt/xzUOENZ
Notes of Stanford NLP course
Preference
Language Technology
Why NLP difficult?
Basic skills
Edit Distance
Language Model
Probabilistic Language Models
Markov Assumption
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$
Unigram Model
$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$
Bigram Model
$$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$
Add-k Smoothing
$$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$
Unigram prior smoothing
$$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$
Smoothing Algorithm
Spelling Correction
p(misspelling|word)
Text Classification
Used for:
Methods: Supervised Machine Learning
Naive Bayes
$$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P(c) $$
F Measure
Precision: % of selected items that are correct
Recall: % of correct items that are selected
$$ F=\tfrac{1}{\alpha \tfrac{1}{P} +(1-\alpha)\tfrac{1}{R}}=\tfrac{(\beta^2+1)PR}{\beta^2P+R} $$
Sentiment Analysis
Sentiment Lexicons
Features
Joint and Discriminative
P(c|d)
Features
Maximum Entropy
$$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c'} exp\sum_i \lambda_if_i(c',d)} $$
Named Entity Recognition (NER)
POS Tagging
Parsing
Probabilistic Parsing
Lexicalized Parsing
Dependency Parsing
Information Retrieval
Classic search
Initial stages of text processing
Query processing
Ranked Retrieval
tf-idf weighting
$$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$
Distance: cosine(query, document)
$$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum^{|V|}{i=1}q_id_i}{\sqrt{\sum^{|V|}{i=1}q_i^2}\sqrt{\sum^{|V|}_{i=1}d^2_i}} $$
Weighting
Evaluation
Semantic
Situation
Applications of Thesauri and Ontologies
Word Similarity
Thesaurus-based similarity
$LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$
$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$
$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$
$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\log P(c_1)+\log P(c_2)} $$
$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$
$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$
Distributional models of meaning
$$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$
Question Answering
Approaches
Answer type taxonomy
Keyword selection
Passage Retrieval
Features for ranking candidate answers
Common Evaluation Metrics
$$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$
Summarization
$$ weight(w_i)= \begin{cases} 1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise \end{cases} $$
$$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$
via Cyanide
October 14, 2024 at 03:13PM