Closed isaaclyman closed 9 months ago
Revision:
There is no standard definition for "symbols" or "special characters."
This is incorrect. The Unicode classes \p{P}
and \p{S}
should capture all punctuation and symbols, respectively.
Release v3.4.0 uses a new RegEx-based counter that can handle mixed-language notes without a significant performance hit (averaging about 60ms slower than the old one per 10,000 notes).
I've tried several methods of beating RegEx at its own game but have been handily defeated every time.
It may be possible to eke out more performance by skipping any counts the plugin isn't currently displaying (for example, it's possible to go through files twice as quickly if the non-whitespace character count is skipped). But I expect 99% of vaults will count in less than a second, so I'll save that optimization for later.
This is a record of work for my own notes.
In commit a1487d3, I built a lightweight character-by-character parser to generate word/character counts without any unnecessary memory use. My intent was for it to be both faster and more flexible than Regular Expression parsing, which is what the plugin has used up to this point.
Here's what I learned:
\w
character class, short for "word", only matches unmodified[a-zA-Z0-9_]
. That is, it won't even reliably match native English words, let alone loan words (like résumé) or words in other Latin-based languages (like Spanish and German). And, of course, languages like Arabic and Russian are excluded.I've built some tests and benchmarks and am going back to using a series of string.replace, string.match, and string.split calls with RegEx. My goal is to stay under 500ms per 10,000 files/5,000,000 words on my laptop.