Stats wrong for Chinese text - Githubissues

jzohrab / lute

DEPRECATED: LUTE (Learning Using Texts) is a self-hosted web app for learning language through reading, based on Learning with Texts (LWT)

The Unlicense

118 stars 10 forks source link

Stats wrong for Chinese text #55

Closed jzohrab closed 1 year ago

jzohrab commented 1 year ago

Stats currently are calculated in such a way that don't work for character-based languages, such as Chinese. For example, take this single-page text, with completely garbage terms created:

Even though the terms are trash, they cover 100% of the text, so you'd expect the % to be pretty high ... but the index page shows 0% known:

Obviously not right.

jzohrab commented 1 year ago

Tried a fix in [issue_55_fix_chinese_stats](https://github.com/jzohrab/lute/tree/issue_55_fix_chinese_stats) but the code is way too slow for real prod usage. There may be a far better way to do this, not sure what just yet though.

jzohrab commented 1 year ago

Tried another way that didn't rely on a full render calculation, also failed spectacularly with timeout. New class TokenCoverage in same branch. Messy code too, which is great.

jzohrab commented 1 year ago

Tried yet another method using regex matches, still nowhere near completes processing before 30s timeout, so it's way too slow for prod.

Have sunk several hours into this, because a) it's interesting, and b) if I could figure this out, I'd be able to drop the TextTokens table, which takes up a lot of space. Currently, I'm really only using the TextTokens table for calculating stats -- ... actually, the current methods would probably suffice for calculating stats, so I may revisit this idea for that.

Regardless, I'm still not sure how to calculate coverage accurately for Chinese at the moment. The first method used (do a fake render) seemed to be the best -- still slow-ish, but maybe there are some good optimizations possible in the rendering calculations which feel overcomplicated.

jzohrab commented 1 year ago

Returned to the first method (effectively rendering each page in code), found some good simplifications to the renderable calculator class, but still not good enough. For a book of ~100K spanish words, the stats calc takes ~20s on my Mac, not usable.

Still found some good code optimizations, they're pushed to the branch, and can be pulled into the develop branch. Will handle that separately. Leaving this issue open.

jzohrab commented 1 year ago

Reducing the calc size makes it workable. Merged into the dev branch, added wiki faq page about it -- https://github.com/jzohrab/lute/wiki/Stats-calculation -- and will include it in next launch. Phew.