Closed Zeoic closed 2 weeks ago
Much easier said than done. HTML famously can't be parsed out with regex, and implementing a full HTML parser in a word count plugin doesn't seem reasonable. Markdown tables should be handled just fine; there's probably a plugin that can handle the styling for you in a separate stylesheet if you need an accurate word count.
ah, darn. Too bad there is no way to just get the reading view from obsidian instead of the writing view.
I unfortuantely need the actual HTML tables for my use case, so I can't just use markdown tables with external styling. I'll have to see about some external tool to count words instead I guess. Thanks for the response!
Yeah, it's not the first time someone's wanted to exclude content that doesn't appear in reading view (like comments and URLs), and unfortunately I've had to parse them on a case by case basis. Maybe someday Obsidian will expose a "rendered plaintext" API, that would make things easier.
Sorry to reply to this closed issue again, just wanted to report that I managed to figure out some regex that strips out the tags I use then does a rough word count with templater. Its a little janky, but it works lol.
This gave me a thought however. I wonder if it would be possible to have an off by default custom regex first pass feature. I imagine that would basically double the lag, atleast, that the plugin causes however. Would be neat from a tinkerer's point of view, but understandable for not wanting to put effort into something so niche like that, which is why I didn't make a new feature request. Just an idea!
No apology necessary. I'll have to give this some thought. Regex is an insufficient tool for stripping out HTML, but if there's a regex input, that would mean it's on the user to determine what is sufficient and what's not...but it also potentially creates patterns of use where people are exchanging regex patterns to enter for various purposes, and I don't want to create a workaround culture that displaces meaningful, performant, and convenient features.
There's also the case where a user accidentally enters, say, a space or period in the regex input and then can't figure out why nothing is being counted. It's a lot of power to give users with a wide range of technical ability levels.
Like I said though, I'll think it over.
I agree that regex isn't suitable for most HTML. In my case I was able to only filter the specific tags I use and I just need to keep in mind not to use \<table > or what have you in my story. Defintely not a one size fits all regex string.
Not wanting to foster workaround culture is a very good point, never thought of that aspect. The option being hidden at the bottom of advanced with red text saying it is not reccomended still wont stop some people from breaking things and complaining lol. I guess it could also help point to ideas for baked in improvement when some regex strings get popular enough.
Problem
My current problem is that while writing a novel, I use some rather large stylized HTML tables. While the actual visible text in reading view of the table is roughly 50 words, all word count plugins seem to count the HTML itself, which makes my word count over 500.
Idea
Add an option to only count the visible words in the document while in reading view.
Example
Shows as 32 words instead of 2:![image](https://github.com/isaaclyman/novel-word-count-obsidian/assets/12469173/70a7168d-d9ff-449d-a250-74bced440489)