Closed yuribrown closed 1 year ago
There are other packages that are made to extract various metadata from an HTML page.
html-to-text
does only one thing - outputs page content in a readable form.
I won't exclude the possibility that extra features might appear at some point but I'm not sure how to make it fit the project naturally.
That being said, you can have <head>
section of the page processed and get some parts of it included in the output text.
{
// include head
baseElements: { selectors: ['head', 'body'] },
// need a custom formatter for meta elements,
// I also made a custom formatter for title to have an easily identifiable block
formatters: {
titleFormatter: function (elem, walk, builder, formatOptions) {
builder.openBlock({ leadingLineBreaks: formatOptions.leadingLineBreaks || 2 });
builder.addInline('title: ');
walk(elem.children, builder);
builder.closeBlock({ trailingLineBreaks: formatOptions.trailingLineBreaks || 2 });
},
metaContentFormatter: function (elem, walk, builder, formatOptions) {
builder.openBlock({ leadingLineBreaks: formatOptions.leadingLineBreaks || 2 });
builder.addInline(elem.attribs['name'] + ': ' + elem.attribs['content']);
builder.closeBlock({ trailingLineBreaks: formatOptions.trailingLineBreaks || 2 });
}
},
// skip everything in the head except title and description
selectors: [
{ selector: 'head > *', format: 'skip' },
{ selector: 'head > title', format: 'titleFormatter' },
{ selector: 'head > meta[name="description"]', format: 'metaContentFormatter' },
]
}
Thanks !
Hope to get contents of