inpho / vsm

Vector Space Model Framework developed for InPhO
http://inpho.github.io/vsm
Other
35 stars 14 forks source link

Chinese corpus model topics showing English words only #137

Closed colinallen closed 6 years ago

colinallen commented 8 years ago

Models trained on Xiaohong's papers picking up only English words

In [3]:

# print the most frequent terms in the document
tf_v.coll_freqs()
Out[3]:
Collection Frequencies
Word    Counts  Word    Counts
science 70  simon   24
discovery   43  information 21
studies 33  computer    20
social  33  langley 20
vol 29  bacon   19
press   29  oxford  19
scientific  28  artificial  18
falun   27  for 18
philosophy  26  cambridge   18
university  25  ibid    18
JaimieMurdock commented 6 years ago

Fixed. bad tokenization.