isaaclyman / novel-word-count-obsidian

Obsidian plugin. Displays a word count or other statistic for each file, folder and vault in the File Explorer pane.
https://obsidian.md/plugins?id=novel-word-count
MIT License
82 stars 8 forks source link

Try using a manual parser to apply more complex logic and get combined CJK/space-delimited counts #79

Closed isaaclyman closed 7 months ago

isaaclyman commented 7 months ago

This is a record of work for my own notes.

In commit a1487d3, I built a lightweight character-by-character parser to generate word/character counts without any unnecessary memory use. My intent was for it to be both faster and more flexible than Regular Expression parsing, which is what the plugin has used up to this point.

Here's what I learned:

I've built some tests and benchmarks and am going back to using a series of string.replace, string.match, and string.split calls with RegEx. My goal is to stay under 500ms per 10,000 files/5,000,000 words on my laptop.

isaaclyman commented 7 months ago

Revision:

There is no standard definition for "symbols" or "special characters."

This is incorrect. The Unicode classes \p{P} and \p{S} should capture all punctuation and symbols, respectively.

isaaclyman commented 7 months ago

Release v3.4.0 uses a new RegEx-based counter that can handle mixed-language notes without a significant performance hit (averaging about 60ms slower than the old one per 10,000 notes).

I've tried several methods of beating RegEx at its own game but have been handily defeated every time.

It may be possible to eke out more performance by skipping any counts the plugin isn't currently displaying (for example, it's possible to go through files twice as quickly if the non-whitespace character count is skipped). But I expect 99% of vaults will count in less than a second, so I'll save that optimization for later.