Currently Coldsweat does very little to format feed entries. It optionally parses entries looking for images and links having blacklisted domains and removes it and nothing more.
This yields to entries which are mostly rendered as-is then applying generic CSS styles which mostly works. However, there are entries which are written like this:
Aenean lacinia bibendum nulla sed consectetur. Sed posuere consectetur est at lobortis. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Cras justo odio, dapibus ac facilisis in, egestas eget quam.
<br>
<br>
Etiam porta sem malesuada magna mollis euismod. Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Cras mattis consectetur purus sit amet fermentum. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
Or worse:
Etiam porta sem malesuada magna mollis euismod. Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Cras mattis consectetur purus sit amet fermentum. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
<br>
<br>
<br>
<br>
<br>
<br>
[eof]
This causes a huge padding added at the end of entry. Which makes me think that there's room for improvement.
One idea is to pass each entry content thru a processor which strips most of the HTML tags, keeping only the necessary formatting hints. Think of something like HTML -> Markdown and then Markdown -> HTML.
Empty elements
Empty elements like <p></p> or <td></td>will be stripped. Multiple consecutive occurrences of <br> will be removed too.
Allowed tags
Non empty tags left as-is while parsing will be: p, table and all its child elements, ul, ol, dl, li, dt, dd, bold, blockquote, strong, i, em, code, var, kdb, img, figure and figcaption, ecc.
Script blocks
Script blocks are already removed by Feedparser.
Allowed attributes
Most formatting attributes like "style", "align", etc. will be stripped. This will help us to reformat content, especially replaced-inline elements like embedded images.
Currently Coldsweat does very little to format feed entries. It optionally parses entries looking for images and links having blacklisted domains and removes it and nothing more.
This yields to entries which are mostly rendered as-is then applying generic CSS styles which mostly works. However, there are entries which are written like this:
Or worse:
This causes a huge padding added at the end of entry. Which makes me think that there's room for improvement.
One idea is to pass each entry content thru a processor which strips most of the HTML tags, keeping only the necessary formatting hints. Think of something like HTML -> Markdown and then Markdown -> HTML.
Empty elements
Empty elements like
<p></p>
or<td></td>
will be stripped. Multiple consecutive occurrences of<br>
will be removed too.Allowed tags
Non empty tags left as-is while parsing will be: p, table and all its child elements, ul, ol, dl, li, dt, dd, bold, blockquote, strong, i, em, code, var, kdb, img, figure and figcaption, ecc.
Script blocks
Script blocks are already removed by Feedparser.
Allowed attributes
Most formatting attributes like "style", "align", etc. will be stripped. This will help us to reformat content, especially replaced-inline elements like embedded images.
References