html-to-text / node-html-to-text

Advanced html to text converter
Other
1.61k stars 223 forks source link

How do you decode & convert in one pass? #307

Closed nrathi closed 1 year ago

nrathi commented 1 year ago

The goal I want to extract the text from a string that looks like this: "<p><b>This hardcover book titled "Phantom Prey" by John Sandford was published in 2008 by Penguin Publishing Group. It is a thrilling work of fiction that falls under the categories of "Thrillers/General," "Thrillers/Suspense," and "Mystery & Detective/General." The book measures 9.3in x 6.3in x 1.2in and has 384 pages. It is written in English and weighs 20 oz.</b></p><br /><p><b>The story is about a detective named Lucas Davenport who solves a series of murders in Minneapolis. The book is in excellent condition and ready to be enjoyed by someone who loves a good thriller. Get your hands on this exciting read and enjoy the rollercoaster ride of suspense and mystery.</b></p>"

Best attempt const res = convert(convert(str), { wordwrap: false }); This seems to work, but it seems wrong.

The question Is there a way to do what I want, without calling the function twice?

KillyMXI commented 1 year ago

You have text with all HTML entities encoded/escaped. You need to find any other way to decode HTML entities first to get the HTML. There are many ways to do this task, I won't be choosing any specific one for you.

The fact that calling convert twice does work is a byproduct of it also decoding HTML entities by default.

nrathi commented 1 year ago

Okay thank you, I'll just call it twice then for simplicity.

KillyMXI commented 1 year ago

Yes, this might work fine, as long as you don't care about some performance overhead.