newsboat / newsboat

An RSS/Atom feed reader for text terminals
https://newsboat.org/
MIT License
2.94k stars 214 forks source link

How to use regular expression to highlight/ignore #642

Closed bertocq closed 4 years ago

bertocq commented 5 years ago
Newsboat version ``` System: Darwin 18.7.0 (x86_64) Compiler: g++ 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4) ncurses: ncurses 5.7.20081102 (compiled with 5.7) libcurl: libcurl/7.54.0 LibreSSL/2.6.5 zlib/1.2.11 nghttp2/1.24.1 (compiled with 7.54.0) SQLite: 3.24.0 (compiled with 3.24.0) libxml2: compiled with 2.9.4 ```

I've been trying to configure ignoring/highlighting articles based on regular expressions without success.

For example /\bgit\b/i should match any appearance of the word "git", but not when it's part of a bigger word like digital.

But neither of this work and I can't figure out why :/

ignore-article "*" "title =~ \"/\bfacebook\b/i\""
highlight-article "title =~ \"/\bgit\b/i\"" white blue bold

Could you either help me with the regular expression or point me to any resource that I can follow to get it right? Thanks!

Minoru commented 5 years ago

Hi! Yeah, regexes in Newsboat are under-documented. These should work:

ignore-article "*" "title =~ \"\\bfacebook\\b\""
highlight-article "title =~ \"\\bgit\\b\"" white blue bold

Specifically:

Note that ignore-article won't do anything to the articles you already fetched unless you change ignore-mode to display.

Does that answer your question?

bertocq commented 5 years ago

Thanks for your quick response 😃!

I'm ashamed I didn't realize the scaping char was \ instead of / 🤦‍♂, great catch!

Still, I tried those but the \b word boundary matcher is not being interpreted, because neither article ignoring/highlighting work 😞.

How I reproduce it:

Config

I used the config you mentioned:

highlight-article "title =~ \"\\bgit\\b\"" white blue bold

Feed

I wrote a test feed with some positive/negative scenarios. You can use the gist's raw url at your urls file.

https://gist.githubusercontent.com/bertocq/2829d9ac0c519cfd4fe7ed8a06b60e5b/raw/b619837a65420fbca41a7e70dfc419be4c06448d/hightlight-article_title_words_text.xml
Matching titles (should be highlighted) ``` "git" is cool Is .git, not .gut Git.js .git Learn git Some git tricks Git Git got digitalized ```
Not matching titles (should not be highlighted) ``` Digit Digital Gittern ```

Findings

Thinking that the solution could be again obvious, I've tried to research a bit and learned:

Minoru commented 5 years ago

I ran Newsboat (current master, 2e696502ba5d6109b6257ac33c24cd777eaab7f4) with debug logging (--log-file=newsboat.log --log-level=6) and found that \\bgit\\b is understood as \bgitb (instead of \bgit\b). To make it work properly, I had to double-escape the second backslash:

highlight-article "title =~ \"\\bgit\\\\b\"" white blue bold

This reminds me of https://github.com/newsboat/newsboat/issues/536 ; I wonder if those issues are connected.

So you're doing everything right, @bertocq, it's just Newsboat being a bit broken =\ Re-tagging this as a bug. Thank you for the report!