neelsmith / CitableCorpusAnalysis.jl

Work with multiple models of a text corpus.
https://neelsmith.github.io/CitableCorpusAnalysis.jl/stable/
GNU General Public License v3.0
0 stars 0 forks source link

Underlying bug in `TextAnalysis` in `preparecorpus`? #38

Open neelsmith opened 1 year ago

neelsmith commented 1 year ago

Generates deadly error on Greek text with multi-byte encoding, due to blind use of byte indexing rather than Unicode-aware processing?

neelsmith commented 1 year ago

Maybe just wrap this in a try and don't sweat it if it fails? What are consequences of not stripping punct. out?

neelsmith commented 1 year ago

See this: https://discourse.julialang.org/t/stringdocument-of-the-textanalysis-package/62589

neelsmith commented 1 year ago

For 0.7 release, wrapping in a try and warning. You can always preprocess the CitableTextCorpus rather than relying on TextAnalysis to remove punctuation.

neelsmith commented 1 year ago

See also #49 and #50 in preparing for a release supporting reproducible topic modeling