google-research / uda

Unsupervised Data Augmentation (UDA)
https://arxiv.org/abs/1904.12848
Apache License 2.0
2.17k stars 313 forks source link

Global word frequency calculation #121

Open ClaudiaShu opened 1 year ago

ClaudiaShu commented 1 year ago

Hi, I have a question about computing the replacement S score.

In your paper, the score is obtained by $S(w) = freq(w)IDF(w)$. However, in the code, this score is calculated by adding the TF-IDF score of a term in every document as below. However, $freq(w)$ in the corpus is not the sum of word frequency in a document. Moreover, the idf score of a term in the corpus should always be the same since the number of documents that contains term $w$ and the number of documents are always the same.

# Compute TF-IDF
tf_idf = {}
for i in range(len(examples)):
  cur_word_dict = {}
  cur_sent = copy.deepcopy(examples[i].word_list_a)
  if examples[i].text_b:
    cur_sent += examples[i].word_list_b
  for word in cur_sent:
    if word not in tf_idf:
      tf_idf[word] = 0
    tf_idf[word] += 1. / len(cur_sent) * idf[word]