ndmitchell / tagsoup

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents
Other
231 stars 37 forks source link

`innerText` alternative with more expected spacing #90

Open tysonzero opened 2 years ago

tysonzero commented 2 years ago

Currently innerText converts <p>foo</p><p>bar</p> to foobar.

It would be nice to have a toText that does something like the following:

toText $ parseTags "<p>foo</p><p>bar</p>"
-- "foo\nbar"
toText $ parseTags "foo<br>bar"
-- "foo\nbar"
toText $ parseTags "foo<br><br><br>bar"
-- "foo\n\n\nbar"
toText $ parseTags "click <a>me</a>"
-- "click me"
toText $ parseTags "foo <em>bar</em> baz"
-- "foo bar baz"
toText $ parseTags "foo <div>bar</div> baz"
-- "foo\nbar\nbaz"
toText $ parseTags "<p>   hello      world    </p>"
-- "hello world"
toText $ parseTags "<div>foo</div><div>  </div><div>\n</div><div>bar</div>"
-- "foo\nbar"
tysonzero commented 2 years ago

https://github.com/polimorphic/html-to-text

Feel free to merge whichever parts in to tagsoup

ndmitchell commented 1 year ago

Thanks for the comment - I'm not really actively maintaining tagsoup, so will try and find a different maintainer who can see if this is worthwhile or not.