Closed jzohrab closed 1 year ago
Tried a fix in [issue_55_fix_chinese_stats](https://github.com/jzohrab/lute/tree/issue_55_fix_chinese_stats)
but the code is way too slow for real prod usage. There may be a far better way to do this, not sure what just yet though.
Tried another way that didn't rely on a full render calculation, also failed spectacularly with timeout. New class TokenCoverage
in same branch. Messy code too, which is great.
Tried yet another method using regex matches, still nowhere near completes processing before 30s timeout, so it's way too slow for prod.
Have sunk several hours into this, because a) it's interesting, and b) if I could figure this out, I'd be able to drop the TextTokens table, which takes up a lot of space. Currently, I'm really only using the TextTokens table for calculating stats -- ... actually, the current methods would probably suffice for calculating stats, so I may revisit this idea for that.
Regardless, I'm still not sure how to calculate coverage accurately for Chinese at the moment. The first method used (do a fake render) seemed to be the best -- still slow-ish, but maybe there are some good optimizations possible in the rendering calculations which feel overcomplicated.
Returned to the first method (effectively rendering each page in code), found some good simplifications to the renderable calculator class, but still not good enough. For a book of ~100K spanish words, the stats calc takes ~20s on my Mac, not usable.
Still found some good code optimizations, they're pushed to the branch, and can be pulled into the develop branch. Will handle that separately. Leaving this issue open.
Reducing the calc size makes it workable. Merged into the dev branch, added wiki faq page about it -- https://github.com/jzohrab/lute/wiki/Stats-calculation -- and will include it in next launch. Phew.
Stats currently are calculated in such a way that don't work for character-based languages, such as Chinese. For example, take this single-page text, with completely garbage terms created:
Even though the terms are trash, they cover 100% of the text, so you'd expect the % to be pretty high ... but the index page shows 0% known:
Obviously not right.