inpho / vsm

Vector Space Model Framework developed for InPhO
http://inpho.github.io/vsm
Other
35 stars 14 forks source link

unicode flag #123

Closed colinallen closed 6 years ago

colinallen commented 8 years ago

It's easy to forget the --unicode flag when running vsm init

Suggestion: if user runs without that flag and no words are found, either: (a) suggest it to the user, or (b) rerun init automatically in unicode mode.

JaimieMurdock commented 8 years ago

In May I closed inpho/vsm#113 by adding support for encoding='detect' in corpusbuilder function signatures. This used chardet, which was anemic - when ingesting a large collection corpus (such as HT1315 with over 650,000 files) it would only be able to parse 3 files a second.

I decided to search for "chardet speed" and found that beautifulsoup4 had run into this issue as well. They fixed it by dynamically loading cchardet - which has 3000x the performance of the native Python version.

To resolve this, I will add a fallback to cchardet to the topic-explorer when no words are found so we can auto-detect the encoding and see if that fixes the issues.

erickpeirson commented 8 years ago

@JaimieMurdock That's good info. Lurking vsm pays off!

colinallen commented 8 years ago

Testing new version (1.0b39-0-g16fbed7), and I'm not sure whether this is a new issue, or a continuation of the current one -- but now, using the XWPapers corpus running init with or without the --unicode flag produces a non-zero word count but appears to find only the English words in the documents and ignores all the Chinese characters.

e.g. Enter the maximum word occurence rate: 5 Filter will remove 3085 occurrences of these 149 words: of the 1 and p 2 a bacon science in 3 discovery 4 j m c6 5 social studies 6 8 press vol falun scientific 19 pp philosophy to s 12 h simon university c 7 floridi 14 information ai 21 t 11 langley computer 10 9 oxford cambridge ibid for st 17 artificial 18 r on li 15 d 13 intelligence 16 20 hongzhi w e denis 22 l bynum zhuan publishing 0 mar ed research ts slezak at logic nature gong technology grim b company luciano cognitive 27 machine 23 qigong by mit k ma eds 26 25 journal is theory human bijker 29 as 24 http moor mind f bradshaw edition china chen wang g.l g foundation ethics evolution nickles new n buddhist problem available yijun xingqiao buddha universe york translation third giere thagard sts taibei system computers shudian i metaphor newyork 28 second school blackwell

JaimieMurdock commented 8 years ago

My hypothesis: the corpusbuilders module is not disabling unidecode when opening unicode codec files. Will investigate tonight.

JaimieMurdock commented 8 years ago

Investigated the unidecode disable. Found that this does not work for file_corpus objects, but XWPapers corpus is a dir_corpus which does have the support. New guess: BaseCorpus object is filtering unicode or unicode is failing silently when printing to console.

JaimieMurdock commented 6 years ago

While it may not resolve the XWPapers issue, we have moved to a default --unicode flag, with removal of characters being with the --decode flag.