Try using a manual parser to apply more complex logic and get combined CJK/space-delimited counts

isaaclyman commented 9 months ago

This is a record of work for my own notes.

In commit a1487d3, I built a lightweight character-by-character parser to generate word/character counts without any unnecessary memory use. My intent was for it to be both faster and more flexible than Regular Expression parsing, which is what the plugin has used up to this point.

Here's what I learned:

It's very difficult to beat the standard library. At best, the manual parser was still 2-4 times slower than multiple RegEx string.match calls.
RegEx is inescapable. You can outperform it if you want to test for a small set of known characters, but for anything else, you'll end up back at RegEx.
RegEx has a strong bias toward the Latin alphabet without diacritics. The \w character class, short for "word", only matches unmodified [a-zA-Z0-9_]. That is, it won't even reliably match native English words, let alone loan words (like résumé) or words in other Latin-based languages (like Spanish and German). And, of course, languages like Arabic and Russian are excluded.
Unicode character classes are a moving target. There are hundreds of them, some added as recently as 2021.
There is no standard definition for "symbols" or "special characters." I'd like to exclude these from word counts, but I'll have to be explicit about which ones.

I've built some tests and benchmarks and am going back to using a series of string.replace, string.match, and string.split calls with RegEx. My goal is to stay under 500ms per 10,000 files/5,000,000 words on my laptop.

isaaclyman commented 9 months ago

Revision:

There is no standard definition for "symbols" or "special characters."

This is incorrect. The Unicode classes \p{P} and \p{S} should capture all punctuation and symbols, respectively.

isaaclyman commented 9 months ago

Release v3.4.0 uses a new RegEx-based counter that can handle mixed-language notes without a significant performance hit (averaging about 60ms slower than the old one per 10,000 notes).

I've tried several methods of beating RegEx at its own game but have been handily defeated every time.

It may be possible to eke out more performance by skipping any counts the plugin isn't currently displaying (for example, it's possible to go through files twice as quickly if the non-whitespace character count is skipped). But I expect 99% of vaults will count in less than a second, so I'll save that optimization for later.

isaaclyman / novel-word-count-obsidian

Try using a manual parser to apply more complex logic and get combined CJK/space-delimited counts #79