html-to-text / node-html-to-text

Advanced html to text converter
Other
1.61k stars 223 forks source link

Hope to get contents of <title> and "discription" #278

Closed yuribrown closed 1 year ago

yuribrown commented 1 year ago

Hope to get contents of & "description" in html page. Can this function be added?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/KillyMXI"><img src="https://avatars.githubusercontent.com/u/13851064?v=4" />KillyMXI</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>There are other packages that are made to extract various metadata from an HTML page.</p> <p><code>html-to-text</code> does only one thing - outputs page content in a readable form. I won't exclude the possibility that extra features might appear at some point but I'm not sure how to make it fit the project naturally.</p> <p>That being said, you can have <code><head></code> section of the page processed and get some parts of it included in the output text.</p> <pre><code class="language-js">{ // include head baseElements: { selectors: ['head', 'body'] }, // need a custom formatter for meta elements, // I also made a custom formatter for title to have an easily identifiable block formatters: { titleFormatter: function (elem, walk, builder, formatOptions) { builder.openBlock({ leadingLineBreaks: formatOptions.leadingLineBreaks || 2 }); builder.addInline('title: '); walk(elem.children, builder); builder.closeBlock({ trailingLineBreaks: formatOptions.trailingLineBreaks || 2 }); }, metaContentFormatter: function (elem, walk, builder, formatOptions) { builder.openBlock({ leadingLineBreaks: formatOptions.leadingLineBreaks || 2 }); builder.addInline(elem.attribs['name'] + ': ' + elem.attribs['content']); builder.closeBlock({ trailingLineBreaks: formatOptions.trailingLineBreaks || 2 }); } }, // skip everything in the head except title and description selectors: [ { selector: 'head > *', format: 'skip' }, { selector: 'head > title', format: 'titleFormatter' }, { selector: 'head > meta[name="description"]', format: 'metaContentFormatter' }, ] }</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/yuribrown"><img src="https://avatars.githubusercontent.com/u/122472454?v=4" />yuribrown</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Thanks !</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>