Word Count Excluding Frontmatter, Comments, and Markdown Links

UXPublishing commented 11 months ago

Problem

What problem would your idea solve? How do you currently manage this problem? I'd like to create writing templates that include metadata, guidelines inside markdown comments, and links to other notes. Word count and character count would only be meaningful if I can exclude metadata, comments, and links.

Idea

Describe the feature you want to suggest. Would it be possible to calculate word count and character count for all text EXCEPT metadata, comments, links, or any other non-visible text?

isaaclyman commented 11 months ago

Metadata is already excluded from the counts.

There's no such thing as a "markdown comment." What are you referring to specifically?

As for links to other notes, the square brackets around a [[link]] don't add any extra words.

UXPublishing commented 11 months ago

Ok. Thank you for clarifying.

By comments, I mean in Obsidian when you use %%Comment%% to include text that only appears in Editor mode. I would put writing guidelines inside comments so they would be available during editing but hidden when the final writing is viewed in Preview mode.

isaaclyman commented 11 months ago

Ah, okay. I wasn't familiar with that feature. I'll look into the Obsidian API and see if there's a way to parse that out.

isaaclyman commented 11 months ago

As of v2.20.0, there's a setting in the Advanced section that allows you to exclude comments. Let me know if you have any trouble with it.

isaaclyman commented 11 months ago

Note that while metadata and comments will be excluded from character count, it will still include links. I think most users would want it to. When you're pasting markdown into another program (or even just pasting links into e.g. Twitter), the link will count against your character limit. Unfortunately there's not a way to ask Obsidian for "only text that's visible in Reading mode", you have to parse each thing out individually.

If there are other things that you think shouldn't be included, feel free to create another issue.

UXPublishing commented 11 months ago

Thank you! I just tested the update and it works as you said it would. I have another feature request so I'll create a new issue.

danieltomasz commented 10 months ago

Would be possible to exclude the HTML comments (both inline and block comments) as well? Obsidian writing goal plugin implemented excluding comments via regex https://github.com/lynchjames/obsidian-writing-goals/blob/main/src/IO/obsidian-file.ts#L21:~:text=import%20type%20%7B%20CachedMetadata,47

The problem with obsidian comments is that they aren't recognized by other common markdown editors , and when I want use comments in the document I want to be compiled via pandoc or shared with other I am forced to use HTML type comments 

isaaclyman commented 10 months ago

That's a fair ask. Thanks for the code reference, I'll see if I can work that in.

danieltomasz commented 10 months ago

I did a PR to the aforementioned plugin, I think that even simpler regex might work with (%%.*?%%|) expression and gmis options for both inline and block comments,

isaaclyman commented 10 months ago

@danieltomasz Thinking through a couple things here.

RegExes are known to be slower than other types of string operations. I don't have a several-thousand-note vault to test on, but I want to be thoughtful about people who do, and the added startup cost while the plugin is reanalyzing every note.
What happens if a user uses both types of comments and they overlap? Obviously a terrible idea to you and me, but I do want to think about it so it doesn't get filed as a bug later.

Example:

Day 2 of learning HTML. To make a comment in HTML, you use <!-- these marks -->.

%% This is the only HTML tag that can use an exclamation point as the first character: <!--. That way the browser won't confuse it with other tags. %%

I wonder if --> could show up with Web Components? Is -- even a valid name for a Web Component?

isaaclyman commented 10 months ago

Also, your RegEx (%%.*?%%|) doesn't work for multi-line comments. Example:

<!-- an 
HTML block comment
that should not count -->

%% A multiline
comment that should not count %%

A working RegEx would need to include line breaks, i.e. (%%[\s\S]*?%%|).

isaaclyman commented 10 months ago

Okay, with a non-capturing group and the lazy quantifier the performance should be okay. I've done a bunch of testing on (?:%%[\s\S]+?%%|) and it works the way you'd expect, even with overlapping comment types. Will go live in the next release.

danieltomasz commented 10 months ago

If you will use specifically with gmis options the simple regex I posted worked multiline and with overlaping commands, As I tested it here https://regex101.com/r/rNlSZq/1 and with my fork of word goal plugin

but I am no expert with regex and I didn't reproducibly tested performance among many files and different possible regex, so I appreciate if you made it more efficient

Author of another plugin Better Word Count uses similar approach to yours https://github.com/lukeleppan/better-word-count/blob/d1f84150df12cef8857218022242f70423d8c1a8/src/constants.ts#L12:~:text=export%20const%20VIEW_TYPE_STATS,13

But big thanks for the update, as it works, no matter which regex expression is using :)

danieltomasz commented 10 months ago

If you are interested in optionally excluding markdown headers and other markdown syntax (curently the below text in header is counted as 2 words)

## Header

, author of obsidian-writing-goals uses remove-markdown library, https://github.com/lynchjames/obsidian-writing-goals/issues/8 but I dont know how this would affect performance for many files in the vault

With the addition of the option to exclude html comments your and @lynchjames plugin gives almost the same estimates now (with the only difference regarding markdown characters) so I am happy now :)

isaaclyman commented 10 months ago

Ah, I missed the s flag on first pass...didn't even know that flag existed. TIL!

remove-markdown is abandonware at the moment and consists of a long list of RegEx .replace() calls. With my reluctance to use one well-optimized RegEx, you can imagine how I feel about tacking on 20 more. Performance is always top-of-mind for me since this plugin scans the entire vault on startup.

Besides, I don't think it's theoretically possible to write a RegEx or even a series of RegExes that fully parses out Markdown. RegEx is a regular grammar, whereas Markdown is irregular. To fully meet the spec you'd need a formal compiler, probably? Maybe something could be done with Marked (which would likely outperform RegEx, too) or strip-markdown, which looks like it doesn't use RegEx at all.

In any case, I'm glad to know it's satisfactory for now. If more people start complaining about formatting marks being counted as words, I can look into something more intricate.

isaaclyman / novel-word-count-obsidian

Word Count Excluding Frontmatter, Comments, and Markdown Links #45