mrusme / reader

reader is for your command line what the “readability” view is for modern browsers: A lightweight tool offering better readability of web pages on the CLI.
https://xn--gckvb8fzb.com/reader-web-page-readability-on-the-cli/
GNU General Public License v3.0
258 stars 9 forks source link

Fail to parse email's Html containing french punctuation and a quote. #16

Closed tkapias closed 1 month ago

tkapias commented 9 months ago

I use reader as a first step in my script to produce an output for Neomutt email client's pager. The script receiver the raw html and then pipe it as markdown to pandoc, elinks and then less (to add references and colors).

That's the best solution I found to get something clean, formatted and highlighted for Neomutt html diplay.

Issue

But, a few days ago, I noticed that a message where reader was not displaying the sender's message, just the quoted part.

It may be related to the gmail html formating or the text itself.

Example

The message is a reply to my previous message and was sent from gmail. (I replaced private text by X's)

Xxxx xxxxxxx,

Xxxxx xx xxxxxx. Xxxxxxxx xxxx x'xxx xxxxxx, xxxxx x xxxxxxx.

Xx xxxx x'xx xxxxxx x'xxx xxxxx xxx xx xxx x'xxx xxxx, xx x'xx xxxx xx xxx. xxxx x xxx x'xxx xxxx x'xx xxxx x xxx, xxxxxx, xx x'xxx xxx x'xxxx xxxx. xxx xxx x'xxx xxxx.

Xx x'xxxxxx, bonne soirée.

Tomasz

The part above the quote is not parsed by reader.

mrusme commented 9 months ago

@tkapias first of all: That's a super nifty use case! Do you happen to have dotfiles with the Neomutt config? Would be curious to try it myself. :-)

I'll have a look at the specific issue. My gut feeling is that it's rather github.com/JohannesKaufmann/html-to-markdown that reader uses for converting the HTML to Markdown.

While this is something I will dig deeper into, I have a different idea to solve this issue more elegantly. It sounds like you're already dealing with Markdown, which you pipe to Pandoc. Would it work for you if reader would provider a --markdown-input option, so that the conversion from Markdown to HTML and from HTML to Markdown could be cut out?

tkapias commented 9 months ago

About the second point: a new feature for reader

To find the current pipeline with Reader, I tried maybe 20 other tools and a lot of combinations with iconv, many pagers and highlighters.

The issue was that nothing combine the specific format needed by neomutt pager, the display of urls as references and a good parsing of tables and element imbrications.

Email solution providers love imbrications of elements and strange tables.

So the only solution I found is to use reader to parse the most important part of the message, then clean it with pandoc to get nice tables that elinks can read, and elinks then add references and some colors. And I wrap it all at 80 columns.

But if you find a way to shorten all that, it would be huge.

My neomutt setup

My Neomutt setup is a huge work-in-progress. I use 'mbsync', 'notmuch' and 'afew' to sync my Imap accounts and sort the messages. And I use Msmtp as a sender. All taht is run by a systemd timer.

That's how the last Github notification message looks like.

image

To get that pager Display I customized a lot of Neomutt's settings and colors, and used a script to convert the text/html messages in the mailcap file.

takes a temporary HTML attachment from Neomutt's autoview and return a cleaned, formated, colored output, ready for the builtin pager.

requires 3 attributes: filename, charset, columns

shopt -s extglob

export LC_ALL="C.UTF-8" export TZ=:/etc/localtime

if [[ $3 -lt 80 ]]; then _columns=$3 else _columns=80 fi

reader --image-mode none --markdown-output --terminal-width $_columns "$1" | pandoc -f commonmark+emoji+pipe_tables -t html+empty_paragraphs --wrap auto --columns $_columns --preserve-tabs --tab-stop 2 | elinks -no-connect 1 -localhost 1 -dump 1 -dump-color-mode 4 --force-html -dump-width $_columns | LESS_COLUMNS=$_columns less -QRXs


- My Elinks config is custom too, and it may be important:

ELinks 0.16.1.1 configuration file

set config.comments = 3
set config.indentation = 2
set config.saving_style = 3
set document.browse.images.display_style = 2
set document.browse.images.image_link_tagging = 1
set document.browse.images.image_link_prefix = "["
set document.browse.images.image_link_suffix = "]"
set document.browse.images.label_maxlen = 0
set document.browse.images.show_as_links = 1
set document.browse.images.show_any_as_links = 1
set document.browse.links.active_link.enable_color = 1
set document.browse.links.color_dirs = 1
set document.browse.links.numbering = 1
set document.browse.links.show_goto = 1
set document.browse.links.label_key = "0123456789"
set document.browse.margin_width = 2
set document.browse.preferred_document_width = 80
set document.browse.use_preferred_document_width = 1
set document.codepage.force_assumed = 0
set document.colors.text = "#c3c3c3"
set document.colors.background = "#011627"
set document.colors.link = "#5555ff"
set document.colors.vlink = "#5555ff"
set document.colors.image = "#ff8888"
set document.colors.bookmark = "#5555ff"
set document.colors.use_link_number_color = 1
set document.colors.link_number = "#21c7a8"
set document.colors.increase_contrast = 0
set document.colors.ensure_contrast = 0
set document.colors.use_document_colors = 0
set document.dump.codepage = "System"
set document.dump.color_mode = 4
set document.dump.numbering = 1
set document.dump.references = 1
set document.dump.terminal_hyperlinks = 0
set document.dump.separator = "

" set document.dump.width = 80 [0/701] set document.html.display_frames = 1
set document.html.display_iframes = 0
set document.html.display_tables = 1
set document.html.display_subs = 1
set document.html.display_sups = 1
set document.html.link_display = 2
set document.html.underline_links = 1
set document.html.wrap_nbsp = 1
set document.plain.display_links = 0
set document.plain.compress_empty_lines = 1
set document.plain.fixup_tables = 1
set terminal.rxvt-unicode.charset = "UTF-8"
set terminal.rxvt-unicode.underline = 1
set terminal.rxvt-unicode.italic = 1
set terminal.rxvt-unicode.transparency = 1
set terminal.rxvt-unicode.colors = 4
set terminal.rxvt-unicode.block_cursor = 1
set terminal.rxvt-unicode.restrict_852 = 0
set terminal.rxvt-unicode.combine = 1
set terminal.rxvt-unicode.utf_8_io = 1
set terminal.rxvt-unicode.m11_hack = 1
set terminal.rxvt-unicode.latin1_title = 0
set terminal.rxvt-unicode.type = 2
set terminal.tmux-256color.underline = 1
set terminal.tmux-256color.italic = 1
set terminal.tmux-256color.transparency = 1
set terminal.tmux-256color.colors = 4
set terminal.tmux-256color.block_cursor = 1
set terminal.tmux-256color.restrict_852 = 0
set terminal.tmux-256color.combine = 1
set terminal.tmux-256color.utf_8_io = 1
set terminal.tmux-256color.m11_hack = 0
set terminal.tmux-256color.latin1_title = 0
set terminal.tmux-256color.type = 2
set terminal.tmux-direct.charset = "UTF-8"
set terminal.tmux-direct.underline = 1
set terminal.tmux-direct.italic = 1
set terminal.tmux-direct.transparency = 1
set terminal.tmux-direct.colors = 4
set terminal.tmux-direct.block_cursor = 1
set terminal.tmux-direct.restrict_852 = 0
set terminal.tmux-direct.combine = 1
set terminal.tmux-direct.utf_8_io = 1
set terminal.tmux-direct.m11_hack = 0
set terminal.tmux-direct.latin1_title = 0
set terminal.tmux-direct.type = 2

mrusme commented 2 months ago

Sorry for the long delay. I have found the reason for why your mail is being mangled and I have started implementing a fix in Journalist that is needed to implement a fix in reader. However, it turns out that one crucial dependency that reader has been using -- github.com/tinoquang/go-cloudflare-scraper -- has vanished, making it impossible for me to build a new version of reader atm.

I am working on fixing the dependency issue and, after that, implement the fix for your use case.

mrusme commented 1 month ago

A fix for this issue was implemented. You can now use the -r option of reader for your scripts and it won't mangle your mails.