mojolicious / dom.js

:crystal_ball: A fast and very small HTML/XML DOM parser with CSS selectors
https://www.npmjs.com/package/@mojojs/dom
MIT License
20 stars 3 forks source link
dom dom-manipulation javascript mojojs nodejs typescript

[![](https://github.com/mojolicious/dom.js/workflows/test/badge.svg)](https://github.com/mojolicious/dom.js/actions) [![Coverage Status](https://coveralls.io/repos/github/mojolicious/dom.js/badge.svg?branch=main)](https://coveralls.io/github/mojolicious/dom.js?branch=main) [![npm](https://img.shields.io/npm/v/@mojojs/dom.svg)](https://www.npmjs.com/package/@mojojs/dom) A fast and very small HTML/XML DOM parser with CSS selectors for Node.js and browsers. Written in TypeScript. ```js import DOM from '@mojojs/dom'; // Parse const dom = new DOM('

Test

123

'); // Find console.log(dom.at('#b').text()); console.log(dom.find('p').map(el => el.text()).join('\n')); console.log(dom.find('[id]').map(el => el.attr.id).join('\n')); // Modify dom.at('div p').append('

456

'); dom.find(':not(p)').forEach(el => el.strip()); // Render console.log(dom.toString()); ``` #### HML and XML There are currently two input formats supported, HTML and XML fragments. By default we use a very relaxed custom parser that will try to make sense of whatever tag soup you hand it. In HTML mode, all tags and attribute names are lowercased and selectors need to be lowercase as well. ```js // HTML const dom = new DOM('

Hello World!

'); // XML const dom = new DOM('http://example.com', {xml: true}); ``` #### Nodes and Elements When we parse an HTML/XML document or fragment, it gets turned into a tree of nodes. ```html Hello World! ``` There are currently eight different kinds of nodes, `#cdata`, `#comment`, `#doctype`, `#document`, `#element`, `#fragment`,`#pi`, and `#text`. ``` #fragment |- #doctype (html) +- #element (html) |- #element (head) | +- #element (title) | +- #text (Hello) +- #element (body) +- #text (World!) ``` While nodes such as `#document` and `#fragment` can be represented by `DOM` objects, features like `dom.attr` and `dom.tag` will not work for them. #### CSS Selectors All CSS selectors that make sense for a standalone parser are supported. | Pattern | Represents | | --- | --- | | `*` | any element | | `E` | an element of type E | | `E:not(s1, s2, …)` | an E element that does not match either compound selector s1 or compound selector s2 | | `E:is(s1, s2, …)` | an E element that matches compound selector s1 and/or compound selector s2 | | `E.warning` | an E element belonging to the class warning | | `E#myid` | an E element with ID equal to myid | | `E[foo]` | an E element with a foo attribute | | `E[foo="bar"]` | an E element whose foo attribute value is exactly equal to bar | | `E[foo="bar" i]` | an E element whose foo attribute value is exactly equal to any (ASCII-range) case-permutation of bar | | `E[foo="bar" s]` | an E element whose foo attribute value is exactly and case-sensitively equal to bar | | `E[foo~="bar"]` | an E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to bar | | `E[foo^="bar"]` | an E element whose foo attribute value begins exactly with the string bar | | `E[foo$="bar"]` | an E element whose foo attribute value ends exactly with the string bar | | `E[foo*="bar"]` | an E element whose foo attribute value contains the substring bar | | `E:any-link` | an E element being the source anchor of a hyperlink | | `E:link` | an E element being the source anchor of a hyperlink of which the target is not yet visited | | `E:visited` | an E element being the source anchor of a hyperlink of which the target is already visited | | `E:checked` | a user interface element E that is checked/selected (for instance a radio-button or checkbox) | | `E:root` | an E element, root of the document | | `E:empty` | an E element that has no children (neither elements nor text) except perhaps white space | | `E:nth-child(n [of S]?)` | an E element, the n-th child of its parent matching S | | `E:nth-last-child(n [of S]?)` | an E element, the n-th child of its parent matching S, counting from the last one | | `E:first-child` | an E element, first child of its parent | | `E:last-child` | an E element, last child of its parent | | `E:only-child` | an E element, only child of its parent | | `E:nth-of-type(n)` | an E element, the n-th sibling of its type | | `E:nth-last-of-type(n)` | an E element, the n-th sibling of its type, counting from the last one | | `E:first-of-type` | an E element, first sibling of its type | | `E:last-of-type` | an E element, last sibling of its type | | `E:only-of-type` | an E element, only sibling of its type | | `E:text(string)` | an E element containing text content that substring matches the given string case-insensitively | | `E:text(/pattern/i)` | an E element containing text content that regex matches the given pattern | | `E F` | an F element descendant of an E element | | `E > F` | an F element child of an E element | | `E + F` | an F element immediately preceded by an E element | | `E ~ F` | an F element preceded by an E element | All supported CSS4 selectors are considered experimental and might change as the spec evolves. ### API Everything you need to extract information from HTML/XML documents and make changes to the DOM tree. ```js // Parse HTML const dom = new DOM('
Hello World!
'); // Render `DOM` object to HTML const html = dom.toString(); // Create a new `DOM` object with one HTML tag const div = DOM.newTag('div', {class: 'greeting'}, 'Hello World!'); ``` Navigate the DOM tree with and without CSS selectors. ```js // Find one element matching the CSS selector and return it as `DOM` object const div = dom.at('div > p'); // Find all elements marching the CSS selector and return them as `DOM` objects const divs = dom.find('div > p'); // Get root element as `DOM` object (document or fragment node) const root = dom.root(); // Get parent element as `DOM` object const parent = dom.parent(); // Get all ancestor elements as `DOM` objects const ancestors = dom.ancestors(); const ancestors = dom.ancestors('div > p'); // Get all child elements as `DOM` objects const children = dom.children(); const children = dom.children('div > p'); // Get all sibling elements before this element as `DOM` objects const preceding = dom.preceding(); const preceding = dom.preceding('div > p'); // Get all sibling elements after this element as `DOM` objects const following = dom.following(); const following = dom.following('div > p'); // Get sibling element before this element as `DOM` objects const previous = dom.previous(); // Get sibling element after this element as `DOM` objects const next = dom.next(); ``` Extract information and manipulate elements. ```js // Check if element matches the given CSS selector const isDiv = dom.matches('div > p'); // Extract text content from element const greeting = dom.text(); const greeting = dom.text({recursive: true}); // Get element tag const tag = dom.tag; // Set element tag dom.tag = 'div'; // Get element attribute value const class = dom.attr.class; // Set element attribute value dom.attr.class = 'whatever'; // Remove element attribute delete dom.attr.class; // Get element attribute names const names = Object.keys(dom.attr); // Get element's rendered content const content = dom.content(); // Get form value const formValue = dom.at('input').val(); const formValue = dom.at('option').val(); const formValue = dom.at('select').val(); const formValue = dom.at('textarea').val(); const formValue = dom.at('button').val(); // Find this element's namespace const namespace = dom.namespace(); // Get a unique CSS selector for this element const selector = dom.selector(); // Remove element and its children dom.remove(); // Remove element but preserve its children dom.strip(); // Replace element and its children dom.replace('

Hello World!

'); // Replace this element's content dom.replaceContent('

Hello World!

'); // Append HTML/XML fragment after this element dom.append('

Hello World!

'); // Append HTML/XML fragment to this element's content dom.appendContent('

Hello World!

'); // Prepend HTML/XML fragment before this element dom.prepend('

Hello World!

'); // Prepend HTML/XML fragment to this element's content dom.prependContent('

Hello World!

'); // Wrap HTML/XML fragment around this element dom.wrap('
'); // Wrap HTML/XML fragment around the content of this element dom.wrapContent('
'); ``` There is also a node level API that you can for example use to extend the `DOM` class. It is however still in flux, and therefore not fully documented yet. ```js // Remove comment nodes that are children of this element dom.currentNode.childNodes .filter(node => node.nodeType === '#comment') .forEach(node => node.detach()); // Extract text surrounding this element const text = dom.currentNode.parentNode.childNodes .filter(node => node.nodeType === '#text') .map(node => node.value) .join(''); ``` #### Custom Parsers Additional input formats, such as fully spec compliant HTML5 documents can be supported with custom parsers. There is a [parse5](https://www.npmjs.com/package/parse5) based [example included in this distribution](https://github.com/mojolicious/dom.js/blob/main/test/support/parse5.js) that we use for testing. ```js import DOM, {DocumentNode, FragmentNode} from '@mojojs/dom'; // Minimal custom HTML/XML parser that only creates the document/fragment objects class Parser { parse(text) { return new DocumentNode(); } parseFragment(text) { return new FragmentNode(); } } // Parse HTML with a custom parser const dom = new DOM('

Hello World!

', {parser: new Parser()}); ``` ## Installation All you need is Node.js 16.0.0 (or newer). ``` $ npm install @mojojs/dom ``` ## Support If you have any questions the documentation might not yet answer, don't hesitate to ask in the [Forum](https://github.com/mojolicious/mojo.js/discussions), on [Matrix](https://matrix.to/#/#mojo:matrix.org), or [IRC](https://web.libera.chat/#mojo).