Define selector syntax for document subsets

nichtich commented 6 years ago

Selection of document subsets such as single chapters and paragraphs should be supported by introduction of a Pandoc document selectors such as XPath/XPointer, Fragment Identifiers, and CSS Selectors. This could simplify creation of filters and support transclusion (#81) and annotation of documents.

Examples

A simple example is extraction of links - in this case the selector is Link:

$ pandoc --select 'Link' -t markdown input.doc
[foo](http://example.org/)[bar](http://example.com/)

Some ideas for selector syntax to select elements by their attributes, element type, and values:

#id
.class
foo=bar
Header{.class}
Link[url^=https://]

Selectors might be combined by alternatives:

Subscript | Superscript

To select larger parts of a document, a range operator or function is needed, for instance

Header[level=1][2] ... Header[level=1]

would select the second chapter of a document (everything from including the second header with level 1 up to before the next header with level 1 (inclusion of the second part of a range could be done by a different syntax e.g. A ...+ B).

nichtich commented 6 years ago

Additional ideas, some taken from CSS Selectors:

Attributes

Image{width<192}

Identifier

#myid 
#"my-id" 
{id=~regex}

Class

.class
."class"
{.class}
{class*=foo}

Properties

Not to be confused with attributes!

Header[level=1]

Attributes may be selected same syntax but properties override attributes if selecetd with [...]. For instance the Header

# xxx {level=99}

Has attribute level with value 99 but property level with value 1.

Select by type of element

:document
:block
:inline
:meta

Negation

!Header

mb21 commented 6 years ago

From where I'm coming from, this probably doesn't belong in pandoc itself. But you could easily create a library (in your programming language of choice) that could then be used to write filters with the syntax you describe – or indeed a filter that does the extraction you mention.

Also, there are already various ways to do something like you describe:

use Haskell pattern matching in Haskell filters
pipe output of -t json to jq
pipe output of -t html to any DOM processor (e.g. nokogiri in Ruby)
pipe output of some XML-based format (e.g. -t docbook or HTML again) to an XPath implementation (like saxon)
etc.

nichtich commented 6 years ago

Thanks for feedback and suggestions! XPath could help if there was an official serialization of the abstract syntax tree in XML. Without native support in pandoc (more specific: pandoc-types), there is a risk of differing implementations. A selector language for pandoc document model should not depend on a specific programming language or technology. At least a simple selector syntax is needed for #81 anyway to specify parts of a document. Another use case is converting annotations between formats (e.g. comments in Word documents and in annotations from services like hypothes.is): see fragment selectors in Web annotation.

jgm commented 6 years ago

I wonder if it would be worth exploring adding some kind of select or filter function to the pandoc API (perhaps in Text.Pandoc.Walk)? It could take a function from elements to boolean as an argument, and all it would do is traverse the tree (preserving order) and remove every element where the function returns false.

This function could then be exposed in lua filters and perhaps ultimately in some kind of command line utility or option.

jgm commented 5 years ago

@tarleb I was thinking about how this could be done with lua filters. Do we have anything corresponding to query (from Text.Pandoc.Walk) in the lua API?

tarleb commented 5 years ago

@tarleb I was thinking about how this could be done with lua filters. Do we have anything corresponding to query (from Text.Pandoc.Walk) in the lua API?

Not yet, no. Adding a generic query function shouldn't be too difficult though.

jgm / pandoc