inconsistency for sections within fenced divs

jgm / pandoc

Universal markup converter

https://pandoc.org

Other

34.45k stars 3.37k forks source link

inconsistency for sections within fenced divs #5761

Closed brainchild0 closed 4 years ago

brainchild0 commented 5 years ago

Fenced divs are documented such that one considers that, like their conterparts fenced code blocks and fenced spans, their use is limited to occurring inside of sections. But fenced divs are different because it is easily possible to conceive a document that violated this constraint.

# Beginning

::: exterior

At first...

:::

In the beginning...

::: interior

# Middle

So it continued...

:::

# Ending

::: exterior

And finally...

:::

In fact, for at least some output formats, the results are neither unsound nor unexpected.

$ pandoc doc.md -o -
<h1 id="beginning">Beginning</h1>
<div class="exterior">
<p>At first…</p>
</div>
<p>In the beginning…</p>
<div class="interior">
<h1 id="middle">Middle</h1>
<p>So it continued…</p>
</div>
<h1 id="ending">Ending</h1>
<div class="exterior">
<p>And finally…</p>
</div>

In other cases, though, the effects are less settling.

$ pandoc doc.md -o doc.epub --metadata title=Title

$ unzip -l doc.epub
Archive:  doc.epub
  Length      Date    Time    Name
---------  ---------- -----   ----
       20  2019-09-19 22:51   mimetype
      251  2019-09-19 22:51   META-INF/container.xml
      160  2019-09-19 22:51   META-INF/com.apple.ibooks.display-options.xml
     1439  2019-09-19 22:51   EPUB/content.opf
      909  2019-09-19 22:51   EPUB/toc.ncx
      576  2019-09-19 22:51   EPUB/nav.xhtml
      468  2019-09-19 22:51   EPUB/text/title_page.xhtml
      853  2019-09-19 22:51   EPUB/styles/stylesheet1.css
      560  2019-09-19 22:51   EPUB/text/ch001.xhtml
      462  2019-09-19 22:51   EPUB/text/ch002.xhtml
---------                     -------
     5698                     10 files

The fenced chapter was dropped.

The fenced regions within chapters create no problems:

$ unzip -c doc.epub EPUB/text/ch001.xhtml
Archive:  doc.epub
  inflating: EPUB/text/ch001.xhtml   
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <title>Beginning</title>
  <link rel="stylesheet" type="text/css" href="../styles/stylesheet1.css" />
</head>
<body epub:type="bodymatter">
<section id="beginning" class="level1">
<h1>Beginning</h1>
<p>At first…</p>
<div class="interior">
<h1 id="middle">Middle</h1>
<p>So it continued…</p>
</div>
</section>
</body>
</html>

While the earlier results of wrapping the chapter inside a div may be questioned by some as whether they is desirable, dropping chapters with no warning plainly is undesirable. The question is raised whether to allow this case, and if so, how to treat it consistently across output types.

Special cases might also lead to unpredictable effects:

# One
## One A
::: fence
## One B
# Two
:::

If the output were to wrap each top-level section in a separate XHTML block, as might be a future enhancement, then the solution for constructing a document tree is ambiguous.

The utility of fencing sections or groups of sections may seem dubious. Header attributes are already supported and can, it might seem, be used for the same effect. Header attributes further avoid the problems described above.

But allowing fenced sections or groups of sections and standardizing their handling may be preferred to prohibiting them. And such may prove convenient in some cases.

Suppose the fenced divs simply cause enclosed sections to inherit their attributes. For contexts in which position is relevant, fenced divs may be easier to manage:

::: group1
# One
# Two
:::
::: group2
# Three 
# Four
# Five
:::
# Six {.group3}
# Seven {.group3}
# Eight {.group3}

Enclosed divs would create the same result for One through Five as header attributes do for Six through Eight.

If this handling occurs before the special handling of the unnumbered class, some may find it especially convenient:

::: unnumbered
# Preface
# Introduction
:::
# Beginning 
# Middle
# Ending

Or even:

::: {-}
# Preface
# Introduction
:::
# Beginning 
# Middle
# Ending

Practically, such use may not be preferential for large documents. But in principle it opens compelling possibilities while eliminating unwanted ones.

Of course, some cases emerge in which enclosed divs facilitate effects not readily achieved using header attributes.

A novel organized in three parts might be expressed as:

::: part
# One
# Two
:::
::: part
# Three 
# Four
# Five
:::
::: part
# Six
# Seven
# Eight
:::

Unlike groups above, the numbering of each part is not static in the document, but resolved by position, just like section numbering. The nearest equivalent with header attributes would be:

# One {.newpart}
# Two
# Three {.newpart}
# Four
# Five
# Six {.newpart}
# Seven
# Eight

I would request that this issue be designated a discussion for feature enhancements, but at a minimum, the EPUB handling appears to be a bug.

brainchild0 commented 5 years ago

The updates definitely change the behavior.

It might be useful to understand which behavior has been chosen for adoption.

It looks as though maybe the intention is to adopt the suggestion that attributes from fenced div propagate to the enclosed sections? But it looks as though the changes primarily target EPUB. It also looks as though the effect is slightly different depending on whether the attribute is from the fence or the header.

I haven't tested with other writers. At this point, I don't quite know what to test.

Would it be helpful first to resolve, or simply to explain, the intended function of fenced divs that surround sections? At least that was my intention.

jgm commented 5 years ago

Previously the epub writer used some custom code to split into chapters. I've changed it so it reuses, as much as possible, the makeSections code that is also used with --section-divs in HTML and in many other writers (e.g. docbook). So this should make things more consistent.

Attributes on divs are retained in the chapters as attributes on <section>.

Attributes on headings are unchanged, except that (a) if --number-sections is added, a "number" attribute is added, and (b) the id attribute can in some circumstances jump from the heading to an enclosing div/section.

brainchild0 commented 5 years ago

I understand this explanation. Reusing the logic for EPUB and HTML writers seems like a solid improvement.

The more natural location for the attribute classes seems to be the <section> elements, compared with the header, in every case. Also, using the <section> wrapper regardless of whether any attributes are given would seem to improve consistency, reliability, and usability.

Such considerations may be separate issues, but keeping with the theme of reuse as a means of consistency, would it make sense for now simply to introduce the further change that first is built a set of attributes from the union of the existing fenced attribute context with the section attributes, and then are applied all of the attributes in some uniform way to the contents of the section? At the moment, as we see, attributes are applied differently depending on source. With such a change, any revisions to the downstream logic, in how output is constructed, necessarily would apply to all attributes equivalently, regardless of their source.

The question of where the attributes are placed, such as in the header versus section, can later freely be changed, without unexpected effects, or need of revisiting this issue.

Relating to subsections, a few issues could be discussed:

Do subsections inherit attributes from a fenced context, the same as top-level sections?
If a section inherits an attribute from the fenced context, do subsections inherit the attribute such that it is explicitly applied to the subsection separate from the parent section?
Are key/value attributes handled similarly? If a section or nested fenced context assigns a different value to a key already in use from an outer fenced context, is the earlier value restored to the previous value after the inner context concludes?

jgm commented 5 years ago

Nothing "inherits" attributes here. Rather, when there are fenced divs in the input, they become the chapter's content. (The only change made is the possible addition of an id which moves from the header.) Divs that constitute subsections are not affected.

I think there will always be a section wrapped around the chapter contents, except in one case: when the document does not begin with a header. (In that case there will be no section wrapper around the initial content, which goes in a chapter by itself.) Not sure what the best approach is here.

brainchild0 commented 5 years ago

My main intention was to express the idea that attributes from a fenced div surrounding a section would have the effect of augmenting the attributes associated with that section, as though they appeared in the list of header attributes. As such, one could get the effect of attaching the same attributes to a group of headers, but without explicitly writing the attributes in each header. The approach also solves the problem of what to do with fenced divs that are not contained inside the body text of a single section.

I thought we were on the same page. Or have we diverged?

jgm commented 5 years ago

I am describing the way it currently works. It does not "inherit." You, it seems, would like it to inherit. I disagree that this is desirable.

As for text not contained inside a section, this will only occur for text at the beginning of a document before the first header.

brainchild0 commented 5 years ago

I am describing the way it currently works. It does not "inherit." You, it seems, would like it to inherit. I disagree that this is desirable.

As for text not contained inside a section, this will only occur for text at the beginning of a document before the first header.

These questions may be more specific than the main question I intended to raise.

The core of the idea is that

::: attr1
# Head A { .attr2 }
# Head B { .attr3 }
:::

might be interpreted similar to the way that

# Head A { .attr1 .attr2 }
# Head B { .attr1 .attr3 }

is interpreted.

But it seems like the change you made is not quite following this path.

I think at the moment your idea is that, at least in the case of XHTML-family outputs, the div fence causes the construction of a new document block (i.e. element node) that encloses a section. I think this approach may not work well for a variety of reasons. And one is that it does not generalize well to EPUB because of the impossibility of wrapping multiple XHTML files in a single element block. It also may not generalize as readily to document types outside the XHTML family.

So I suggested moving away from the idea that the effect of a div fence enclosing a section or group of sections is that the section or group is wrapped by an enclosing element, and toward the idea that the sections themselves are modified instead of wrapped.

The intention was to avoid a variety of unpleasant possibilities while at the same time creating new ones that might be pleasant.

You may not like this idea, but I don't particularly see the way it currently works as solving the problems in a graceful way. Conditionally wrapping groups of sections in auxiliary blocks breaks many of the core assumptions that would be made about the document structure, and becomes especially unwieldy in the nested case. More generally, document structure within body text is different than document structure at the higher levels of the document tree, so generalizing the former case to the latter might demand caution at least.

jgm commented 5 years ago

I agree that things aren't yet optimal with

::: foo
# Head A
# Head B
:::

Currently this creates three chapters: the first, empty, the second with a section with class foo, the third with a section without class foo. Certainly the empty chapter should not be created, and I need to fix that. I am also positive about the idea of propagating the div attributes over chapters in this case; there's no good reason why only the 'Head A' chapter should have foo.

brainchild0 commented 5 years ago

For me the single most compelling reason for giving the attribute of the enclosing fenced div to the section/chapter is that currently the most likely alternative is to wrap the sections in a new block created specifically for the fenced div. This alternative seems to be aligned to your current thinking. Yet, such ad-hoc creation of blocks that enclose sections, I think, is problematic. Giving the attribute to the section makes it possible for the attribute to affect the section without the section being enclosed by a special block created specifically for the attribute.

Of course, such reasoning does not preclude any additional possibility that might be presented.

But between the two approaches, I think, the former is less likely to cause headache.

The output for EPUB, with your commit applied, appears to operate to some degree following the principle you might be dismissing. So perhaps you would grant that at least carries it favor in the special case of EPUB output?

jgm commented 5 years ago

Re-opening while I fine-tune this.

brainchild0 commented 5 years ago

From the commit:

Thus
::: foo
:::
puts the class foo on the section divs for both A and B.

Does this intend to read?:

::: foo
# A
# B
:::

Or am I missing something?

jgm commented 5 years ago

Yes, left that out somehow.

brainchild0 commented 5 years ago

OK, getting close.

There are a variety of special cases that will probably need be considered eventually. Examples would be split sections, and subsections. You might not want to do so in this issue particularly.

But running a quick test:

# One

Apple

::: { .a x=1 }

::: { .b x=0 }

::: c

# Two { .d }

Banana

:::

:::

:::

# Three

Orange

With default settings goes to:

<h1 id="one">One</h1>
<p>Apple</p>
<div class="a" data-x="1">
<div class="b" data-x="0">
<section id="two" class="c">
<h1 class="d">Two</h1>
<p>Banana</p>
</section>
</div>
</div>
<h1 id="three">Three</h1>
<p>Orange</p>

Two observations:

Can a, b, and c all get collected into the <section> classes? Similarly, can the x values get consolidated so the that 0 value takes precedence?
Can this set get consolidated with d? I don't see why these sets should be distinct. Seems like they should get merged, for example, in the AST?

jgm commented 5 years ago

I'm still not convinced that this merging (even the limited version in the branch commit above) is a good idea. I can imagine all kinds of unexpected consequences. E.g.

::: #part1 part
::: #sect1 section
# A
:::
::: #sect2 section
# B
:::
:::

With merging, the two section divs would end up having the class part (and section). Surely not intended.

I'm somewhat inclined to keep things simple and treat Divs that don't have the internal structure (Header + list of non-header blocks) as indivisible units for purposes of section splitting.

brainchild0 commented 5 years ago

Yes, I understand the reluctance. I admit it seems strange, especially at first.

Perhaps it is helpul to enumerate as fully as possible the scenarios in which a block enclosing multiple output sections is necessary or even desired, considering the semantic meaning of the input.

How best to represent parts of a book in a source document has been a subject I have considered. I have come to find doubtful the feasibility of enclosing an entire book part in a block in the source document, despite the conceptual simplicity of doing so. Unless one's editor is syntax aware, it seems unwieldy and error-prone to manage block header and footer pairs at opposite ends of a length of text equivalent to tens or hundreds of pages of printed matter. If I consider a means of representing part boundaries that is optimal from an author's standpoint, I am more inclined to think about tagging inline a chapter (as with a header tag) as the first or last in the part, or about including in the metadata a list of chapter identifiers representing part boundaries.

For example:

---
parts: 
  -
    title: Beginning
    id: beg
    start: one
    end: three
  -
    title: Middle
    id: mid
    start: four
    end: six
  -
    title: End
    id: end
    start: seven
    end: nine
---

# Introduction {-}
# One
# Two
# Three
# Four
# Five
# Six
# Seven
# Eight
# Nine
# Appendix {-}

Unlike blocks of quoted text in a section body, book parts are not formatted separately before being merged into surrounding elements. Aside from organizational effects in the table of contents, a book part is largely just a sequence of chapters, the first of which being preceded by a page showing a prominently-formatted header.

Much of the difficulty, I think, with the topic of enclosing sections in fenced divs arises because fenced divs follow a different design, compared to that of the elements in basic MarkDown. As sections and subsections in MarkDown have no explicit terminator, the parser is never able to complain that the input is unusable or even ambiguous because of a block terminator being mismatched to the existing nested context.

If we tried to conceive of a design for fenced divs that follows this pattern, then we might consider one in which each line beginning with ::: simply resets the active set of classes to whichever are immediately specified on that line.

For example, if "bananas" are enclosed by the "tropical" fencing, nested in the larger "fruit" fencing, we could use:

# Food

::: fruit 

apples

::: fruit tropical

bananas

::: fruit

pears

# Clothes

etc.

While the above is not explicitly showing two levels of nested blocks, the class tags provide enough information to resolve that "bananas" is enclosed in a region with the "tropical" class, nested within a larger region with the "fruit" class.

Or, if the above lacks an intuitive feel, we could consider a slightly different approach:

# Food

::: +fruit

apples

::: +tropical

bananas

::: -tropical

pears

# Clothes

etc.

In both cases, the explicit terminator is considered to be required only when the termination is not occurring at the end of the section, which is otherwise the assumption. No ambiguity arises about whether the intention is for the fenced region to contain, or rather be contained by, a section, because only the latter possibility is allowed.

Of course the feasibility and popularity of replacing an entrenched feature seems doubtful.

But perhaps such thinking helps to resolve how best to handle the various cases relating to the existing approach.

brainchild0 commented 5 years ago

How does the reader currently translate fenced divs into the AST?

I notice that the AST is defined without any node type for section, leading one to consider the possibility that divs can meaningfully surround sections as well as text blocks within sections. I'm wondering whether thinking in terms of sections leads to a clearer abstraction about how to distinguish between the way that text blocks are grouped in a document, compared to how sections are grouped.

brainchild0 commented 4 years ago

Lacking understanding of Haskell language and the design of the codebase, I am extrapolating, from the unit tests, that the recent commit changes the behavior such that in the special case that a fenced div in the input document encloses a single section, the div in the output document encloses only the body of that section, and otherwise the div in the output document encloses the multiple sections enclosed by the div in the source.

Is this representation accurate?

I have been studying and considering the Pandoc AST in the duration since this original submission, and have come to wonder whether the use of divs to enclose sections, rather than simply to enclose blocks within a section, is an optimal approach to capturing the abstract structure of documents.

More generally, a node type representing a section strikes me as conspicuous omission, and I am further wondering whether its inclusion might resolve a variety of related questions with respect to handling attributes that apply to an entire section or group of sections.

Notice that one possibility afforded by the introduction of section nodes is to enforce a distinction between which node types might descend from versus be descendants of a section node. As the current structure provides no native representation of a section, it seems possible to imagine absurd cases being represented within a valid AST as currently defined, such as headers appearing in list entries.

A section node type also would make certain kinds of document transformations easier to implement. Currently a section is merely a sequence of contiguous blocks, and that grouping must be resolved by the application.

Writing a simple filter that, for instance, drops all sections below level two, requires a greater degree of explicit logic currently than would be necessary if the relevant grouping of blocks already appeared in the tree hierarchy.

Is the structure of the Pandoc AST taken from earlier precedents in other work, or is it a novel invention?

jgm commented 4 years ago

such as headers appearing in list entries.

Yes, they are allowed. The structure of the Pandoc AST is based around Markdown, which has allowed allowed stuff like this. It would have been different if it had been based around, say, DocBook, that then certain Markdown documents would have been unrepresentable.

brainchild0 commented 4 years ago

I see.

I can't understand why MarkDown allows this case, other than than being designed to be permissive and optimistic.

This case appears to be special, and to conflict with abstract semantics and with other formats, and as such, I would not consider handling it to be obligatory.

Regardless of this detail, I find that as I review the comments above and consider the cases of various formats, I wonder whether the AST could serve its function more optimally if it more closely captured the abstract semantics of documents, rather than seeking to capture serialization strategies, particularly that of Markdown.

Markdown has no explicit section terminator, pursuant to its design goal of minimal user overhead. Yet Markdown is foremost a document serialization strategy, not an abstract document structure. Actually, sections are terminated, though not explicitly, by the end of file or by the beginning of a new section. While header blocks occurring in sequence with text blocks accurately represents the serialized format of Markdown documents, it does not represent their abstract document structure, for the sake of contrast, might be considered juxtaposed to HTML5, which features a more strictly hierarchical form.

I know that the AST was probably carefully considered earlier and I accept that updating it is likely not viewed as practical or appealing, but it might be well to consider that the MD-centric view of documents may introduce limitations both in conversion and transformations.

In the current case, notably, sections in documents might typically be considered having an ancestry path to the document root that has a relatively fixed definition, comprising its supersections, but not other document elements that are normally found inside body text.

kysko commented 4 years ago

@brainchild0 I'm curious about what you mean exactly by "a node type representing a section", or "section node", at the level of the AST. What would it look like, say, as a native output?

Would it be something like:

Sec(<attr>) [ < list of Headers and other Blocks > ]

where <attr> would be some list of "attributes" (not necessarily like current Attr) describing options, states, etc.?

If so, how would this be different from:

Div(<attr>) [ Header <level> (<hd attr>) [Str <title>],
   <rest of list of Blocks>
 ]

I'm not saying Div's can't be abused and misused. But can they be used effectively just like what you'd consider a "section node", when properly done?

Doesn't the Div/Head coupling made by makeSection (or make_sections in Lua) "enforce a distinction between which node types might descend from versus be descendants of a section node" like you said, when "numbering" is enabled?

If not, can you give an example where this wouldn't work the same way?

brainchild0 commented 4 years ago

Would it be something like:
Sec(<attr>) [ < list of Headers and other Blocks > ]
where <attr> would be some list of "attributes" (not necessarily like current Attr) describing options, states, etc.?

Something like that.

I had in mind that a section might have 1) an attribute set, 2) up to one header, 2) zero or more paragraph blocks, and 4) zero or more sections (recursive type).

It could be made more formal if the paragraph blocks and sections (subsections) were captured by another type, a section body. The advantage of this approach is that the document type can then be redefined as metadata and a section body, while representing analogous structure to the old type definition. A section is then a header, an attribute set, and section body. This abstraction facilitates transformations and processing, as the document and sections share in common that they are parents of a section body.

You could tag a section by its depth, but really this value is just derived from its path to the root.

If so, how would this be different from:
Div(<attr>) [ Header <level> (<hd attr>) [Str <title>],
   <rest of list of Blocks>
 ]

Similar, I suppose, but realizing the objective that divs (and other types representing text blocks) occur only within sections, particularly within the body text of sections, and also giving sections their own type, rather than leaving to the application to infer that a div is effectively a section because of having certain properties, like a leading header in its body.

Divs have no particular significance, but for how their attributes are used by the target format. All of the significance relating to sectioning currently rests in the position of the headers, which means even in the above structure, a section is a div that happens to lead with a header.

A clearer representation of a document structure might rather hold that a section is not a div that contains a header, but is rather its own type, and the only type that can contain a header.

I'm not saying Div's can't be abused and misused. But can they be used effectively just like what you'd consider a "section node", when properly done?

Maybe, but the stricter form would be designed to enforce "properly done", rather than admitting as valid the cases you would characterize as misuses.

Doesn't the Div/Head coupling made by makeSection (or make_sections in Lua) "enforce a distinction between which node types might descend from versus be descendants of a section node" like you said, when "numbering" is enabled?

I haven't investigated, but considerable value rests in enforcing the distinction by the data definition, not simply a particular implementation of the application logic.

Don't forget that the core architectural feature of Pandoc is the AST, with not only filters but readers and writers being modular and in principle used as custom or third-party add-ons. As such, the strength and robustness of the entire system stands or falls on whether the AST design makes it easy for these components to behave well, with respect to processing documents with desirable form, and difficult for them to behave badly.

If not, can you give an example where this wouldn't work the same way?

Not sure whether I already addressed this question to your satisfaction, but further to the above, certain processing and transformation operations are considerably simplified if sections are provided as an explicit type, not merely a region that is inferred from rules about the relations among other types, such as body blocks and headers.

Imagine the earlier example of a filter that drops all sections lower than level 2. In fact this filter should be trivial to write, but the current rules relating to resolving the extent of a section make it needlessly difficult. Many problems harder than this one might become clearer if the hierarchy of the types represented directly the section-level relationship between the parts of the document.

kysko commented 4 years ago

I don't doubt that a dedicated element would be "cleaner", be somewhat "self-sufficient" rather than relying on pandoc's internal logic.

The Div/Header combination isn't as pure, but it's a solution that uses what we're already used to, familiar elements, without having to create yet another element. Like you, I'm not into Haskell, so I don't know the reaches of such a change in the AST would involve for Readers and Writers, and if the effort would really be beneficial in comparison to what we have now.

I don't think we see improper use of Div the same way. I have no problem with divs enclosing whole sections; sure, that could be done through adding classes to the sections, but it's convenient to have a quicker way. Misuse would be like one of your example in your first post (with the div.fence), but that's the problem of the user.

I haven't given as much thought to it as you did though, I'm not trying to minimize the subject. I'm not an expert on AST's or extensive technical details about document structures. But some concrete example might convince better the developers who have to decide whether it's worth the time that they'll need to spend on this. 😉

Anyway... I delayed this answer to see about your example to drop secs lower than level 2 (and also because you were editing!), I opted for lower than level 3 because I had a document handy, and it would show better with this level:

-- using a recent nightly pandoc, with make_sections
local droplevel4 = {
  Div = function(d)
    -- if div has class "section", we know d.content[1] is the section header
    if d.classes:includes("section") and d.content[1].level > 3 then return {} end
  end
}

function Pandoc(doc, meta)
  local blocks = pandoc.utils.make_sections(true, 1, doc.blocks)
  blocks = pandoc.walk_block(pandoc.Div(blocks), droplevel4)
  return pandoc.Pandoc(blocks, meta)
end

Is this what you had in mind? ok, it's not "trivial", not a one liner, but not overly complicated. In principle, in droplevel4, one should check if d.content[1] is really a header though, in case of a more complicated script that may modify this, which wouldn't have to be done for your "section nodes" I guess... but I don't mind the price right now.

brainchild0 commented 4 years ago

I don't think we see improper use of Div the same way.

No we don't, and I realize I am in the minority, indeed in this discussion, alone.

I have no problem with divs enclosing whole sections; sure, that could be done through adding classes to the sections, but it's convenient to have a quicker way. Misuse would be like one of your example in your first post (with the div.fence), but that's the problem of the user.

In your above suggestion I see the pattern that I identified earlier, which is the conflation of how items are given in a source document versus the constraints provided for the AST. Enclosing sections in divs might be convenient in the source document, which is a context in which convenience may be appropriate. The AST and target document, however, rather ought to favor clarity and consistency of structure. Moreover, while any source document, especially in a permissive language like MarkDown, might not unambiguously resolve an input set to a meaningful document structure, the purpose of the AST ideally is to embody the set of constraints that separates data sets that capture meaningful documents from those that do not.

Like you am willing to include certain features in the input document because they are convenient, but I prefer to avoid carrying the effects of those conveniences into the generated output, and feel that the AST functions best when rigidly constrained. And like you I am willing to see certain input documents as a "problem for the user", but I would wonder whether faulty AST instances might sooner be considered as a problem, or least design shortcoming, for the AST definition.

Is this what you had in mind? ok, it's not "trivial", not a one liner, but not overly complicated. In principle, in droplevel4, one should check if d.content[1] is really a header though, in case of a more complicated script that may modify this, which wouldn't have to be done for your "section nodes" I guess... but I don't mind the price right now.

The thought experiment I was giving was intended to be different from the problem you solved, though the distinction was not given explicitly. I was considering the ease with which a code snippet can produce this transformation without reliance on external functions. The challenge is to consider whether the relevance of a node in relation to the conceptual features of a document can be resolved easily by its physical position in the tree, particularly its ancestry path. If so, then the metaphors of a tree structure are used to good effect in the AST definition. If not, then applications become bloated with dependencies or extra logic to complete this computation, as when your solution offloads this work to functions provided by the pandoc package.

If you consider a tree expression language, like the many inspired by XPath or XQuery, you can understand that it is easy to write a filter rule that keeps a node only if its type is not a section or it has fewer than three ancestors of type section. The close mapping of the physical position of a node in the tree to its semantic function is the source of the rule's graceful clarity. Such is essentially the logic of your droplevel4 function, with some notable differences, but with the glaring dependency on the make_sections and walk_blocks functions in order to have any value.

kysko commented 4 years ago

First, I'll try again (imperfectly) to understand what you mean for the AST. You'd like the Readers to parse the texts so that the headers be mapped to something symbolically like

Sec(<Header block>, <attr>)[ <list of non-Sec Blocks >, < list of Sec Blocks >]

or more simply

H(<attr>) [ <list of non-H Blocks >, < list of H-Blocks >]

so that we immediately have a hierachy of levels corresponding to the original headers. If not, I'd like to see what you'd expect.

As for the divs...

Imagine such a situation:

::: Intro
# Intro blurb
etc...
:::

::: LeftLatin
# Some Latin header
## Some other Latin header
etc...
etc...
etc...
:::

::: RightEnglish
# Some English header
## Some other English header
etc...
etc...
etc...
:::

where, after the intro, there's would be left and right side texts for comparison. Sure, there are other ways to achieve this, but I think it's more than simply convenient, and I don't see (as I've not researched the subject profoundly) how this wouldn't map correctly, elegantly, with clarity and consistence in the AST. The divs might not be (or not seem to be) of the same kind of hierachical nature as the headers, but they do act as "parent nodes" internally, and I don't see where's the problem. Again, I might have misunderstood what you meant, I'm sorry.

As for the code example...

Ok, I don't know XPath or XQuery, and a few glances here and there does show some simplicity and elegance in the query language used, but I wouldn't be surprised if the underlying implementations have their own kind of walk_block! Ok, lua and pandoc's exported functions might not be as graceful, but we might see them as more low-level, hence more work to get where we want. Perhaps some "library"/lua helper functions might eventually give something similar.

I want to emphasize here that I don't have some deep knowledge in those fields, so I'm not presuming that I'm right.

jgm commented 4 years ago

For what it's worth, I agree that it would be better for some purposes to have a Section container in the AST. I didn't do that originally because I was just trying to represent the structure of Markdown. The current use of Divs for this is somewhat of a hack. At some point in the future we might consider adding a Section container (and this could be proposed on jgm/pandoc-types). But AST changes are very painful breaking changes, which require modifications all through the code base and all through the pandoc ecosystem, so I have been very conservative about them.

brainchild0 commented 4 years ago

@jgm I fully agree that a new AST design is not a casual matter, and it was not my purpose here to prompt a sudden change. I mentioned the possibility that section types are introduced into the AST for a few reasons:

To assess how the AST design might evolve in the long term.
To determine how closely the current design represents a completely abstract document structure.
To consider how the MarkDown and other format-specific representations of a document might best reflect either current AST design or other possible abstract representation of a document.

This issue was originally opened to discuss the concrete, format-specific representations, but I began to wonder whether the conversation would be clearer if a better separation could be made of a single abstract structure from the various concrete representations of supported formats.

When I consider an abstract document structure, I think that a very different set of node types feature above a section level versus below a section level, in the body text, as sections and text are entirely different categories of items The block and inline node types seem necessary and correct to feature within body text, as such is the usual way, and best one I know, to present a sequence of text such that varying subsequences have specific formatting demarcations in a nested form. These metaphors hardly seem natural when applied outside of body text, as in the regions above or between sections, and use of any of them in those contexts appears forced and ad-hoc.

I would not of course argue against employing existing structures as a hack in cases where desired structures are unavailable, but first I like to ask what structures might be considered desired. At the moment, the idea of grouping adjacent sections seems to be popular. The closest I can get to this idea seeming appropriate is a hypothetical suggestion for a section group type that sits beneath a section and contains some but not all of its subsections. If I could be persuaded of this idea, then I would be likely to be also persuaded that a div type might be used for this purpose until such time that it might become viable to introduce a dedicated type, along with a section type, into the AST design. However, I personally have yet to be persuaded of the premise that grouping sections together, other than by their supersection, is a natural, useful, or productive metaphor, and this state of doubtfulness is where I find myself caught.

In either case, I am glad that you are open to considering how the AST might be revised in the future if revision appears feasible and beneficial. Considering the possible variations for features of a completely abstract document, I find new features that could be easily supported by a section-based tree that are not easily added to the current design. Examples include metadata attached to sections and headerless sections. Neither will be useful in every document, I am not currently intending to pitch the reasons why they might be useful at all, only to open them as possibilities.

brainchild0 commented 4 years ago

First, I'll try again (imperfectly) to understand what you mean for the AST. You'd like the Readers to parse the texts so that the headers be mapped to something symbolically like
Sec(<Header block>, <attr>)[ <list of non-Sec Blocks >, < list of Sec Blocks >]
or more simply
H(<attr>) [ <list of non-H Blocks >, < list of H-Blocks >]

The first one, yes, or something very close. I'm not sure that I would understand the second one as a useful simplification, as it appears to me to omit the very features being considered as enhancements.

One quibble is that in the new scheme a header is not a block, as a block by definition is a member of a block sequence, which is body text.

Sure, there are other ways to achieve this, but I think it's more than simply convenient, and I don't see (as I've not researched the subject profoundly) how this wouldn't map correctly, elegantly, with clarity and consistence in the AST.

The general form you are considering is a document with front matter in a modern language, followed by sections of a classical source and modern translation, presented side-by-side? Essentially you are considering a scholarly publication, with translation and commentary, of a classical work?

I can consider some representations of this abstraction, but my impulse, if I understand the object correctly, is that the one you propose would not be a suitable way to capture the particular abstraction, because you use section dermarcations to separate text that belongs together in the same section.

Note also that considering how such a work is composed, the purest form of source representation might be one that spans multiple documents. The classical source material is likely to consist of a relatively static transcription of some physical manuscript, whereas the translation and commentary is an evolving work. Effectively, the object is to compile two sources into a single publication, with formatting appropriate for the target environment.

What you might really be asking is how to build a published document from multiple sources representing equivalent passages in different languages.

Edit: Reflecting on your proposed input format, I understand how it or a similar format may be useful to the authors or compilers of the source document, while at the same time not being particularly accurate as an abstraction in an AST tree, and generally not reflecting the intention for the target format, applying the common understanding of the meaning of sections.

As such, your example might be a very germane one to illustrate how the document structures of the source, abstract, and target forms must in general be considered separately. In case the convenient representation above is employed to facilitate production of material by human writers, I would be led to conclude that an intermediary filter is among the one of the appropriate devices to achieve the target formatting objectives, to interweave each translated version of a section into a single section. As before, a filter of this type is easier to consider if sections are organized into nodes of their own type.

Ok, I don't know XPath or XQuery, and a few glances here and there does show some simplicity and elegance in the query language used, but I wouldn't be surprised if the underlying implementations have their own kind of walk_block!

I think there is a mistake. The application-specific logic captured largely in make_sections and also in walk_block, which you have both invoked, would not be native to abstract query languages.

kysko commented 4 years ago

One quibble is that in the new scheme a header is not a block

Ok, at least the list of Inlines of the Header then. How would you like it to look, symbolically in native form, or in json form?

As for my example for divs, I should have avoided details about "classical sources" and just give a more abstract one.

You said above that you "had in mind that a section might have 1) an attribute set, 2) up to one header (...)" (my emphasis). So you consider the possibility of headless sections, and this looks a bit like what I use the above divs for, perhaps not the way you want, but for now as good as is allowed.

The application-specific logic captured largely in make_sections and also in walk_block, which you have both invoked, would not be native to abstract query languages.

No, they wouldn't. But isn't that the case also for the logic used with the languages (c, js, .Net, java, etc.) that implement the query languages? Or maybe you're saying something completely different, so I'll leave it at that.

@jgm: The current use of Divs for this is somewhat of a hack.

How far is the hack away from an acceptable form, without considerable changes? Let's say we have something like:

Div(<attr>)[<Header list of inlines>][ <list of Blocks >]

with an additional optional argument that is a list of Inlines. That kind of Div would only be created internally by a reader, doing something like an automatic makeSections, or created by a filter, but not directly in the source document by any notation on a div. That optional argument would be filled with the list of Inlines from the original Header; the Header block itself would disappear, all attributes and Inlines now in that special Div.

It's not the specific Sec node @brainchild0 is asking for, but by the mere presence of that optional argument, this would distinguish that Div from the other Divs for which that opt. arg. is empty, treated differently by Writers, who would recognize and use the data to write what corresponds to the headers referenced.

There could be a transitional pandoc option with which the internal makeSection would not be applied and make pandoc work as it is currently.

Would that minimize the changes? Would this be an acceptable option for Section? Don't know.

In any case, I'm glad for what we have right now (and what's coming with 2.8), just throwing some ideas...

brainchild0 commented 4 years ago

Ok, at least the list of Inlines of the Header then. How would you like it to look, symbolically in native form, or in json form?

I would start by considering the following:

Pandoc := Meta Body
Section := Attr Header? Body
Body := Block Section
Header := Inline*

Naturally Header would be no longer a type of Block.

Other than the obvious introduction of the types Section and Body, an important difference from the current model is that all Block nodes descend from a Body node, and no Section, Header, or Body nodes descend from a Block (or Inline) node.

brainchild0 commented 4 years ago

How far is the hack away from an acceptable form, without considerable changes?

@kysko I think the issue is that changing readers or writers to utilize the AST types differently is far less disruptive than changing the AST types, currently the central anchor for all major internal (and external) components.

The recent change that is being called the hack is categorically different from potential changes to the AST, making it much more appealing especially in the short term.

But it should be agreed (I think in many ways it is agreed) that ideally the AST types are construed so that they are not subject to any interpretation by the readers and writers except for how best to mediate between the abstract and a particular concrete form of the document.

What makes me nervous about the divs being able to wrap sections, and especially about their being actively used for that purpose, is that it tends to hold hostage the concept of the abstract document to the realities of concrete formats, as to weaken the ability of the entire system to produce reliably optimal transformations in the greatest number of possible cases.

I come to suggest then, that whatever the AST is for the time being, if at any time the view of what it should be departs from what it is, then that ideal view is the best place to start in considering the questions of this topic.

Again, nothing above represents a call imminently to change the actual AST definition.

brainchild0 commented 4 years ago

the Header block itself would disappear

@kysko The purpose of not changing an AST design would be to keep backward compatibility and to limit or to eliminate the effect on other components. This advantageous is completely lost if the header disappearing is proposed as part of a transitional state.

kysko commented 4 years ago

Maybe I wasn't clear, I'll try to make more sense... If we have a header of level k, attribute a, inlines ht for it's content/title, and if the Bi's are blocks following the header until next header of level k, then make_sections presently transforms (the "hack")

Header k (a) [ht], B1, B2, ...

into

Div(a, .section)[ Header k (n) [ht],
 B1, B2, ...
]

where n is the kv numbering for this header (like number="1.2.3.4") There can be a bijection (if I can call it that) between this and:

Sec(a, k, n)[ht][
 B1, B2, ...
]

as long as the form of the former doesn't change, eg no transformation changes the header into another type of Block, or puts another block between it and the Div. (Native notation above is not exact, trying to be concise)

Rather than introduce a new Sec, I had proposed the use of a changed Div, a supplemental argument, like:

Div(a, k, n)[ht][
 B1, B2, ...
]

because I thought it might be an easier change internally.

So a writer that sees Sec(a, k, n)[ht][B1, B2, ...] (or Div(a, k, n)[ht][B1, B2, ...]) can be told to see the equivalent of Div(a, .section)[Header k (n) [ht], B1, B2, ...], which might help reduce changes. Because of that bijection, I thought this would reduce the pains in a transition... but I don't write Haskell, so I have no idea how naïve the idea is!

(I probably think too much in terms of functions, and that might not be appropriate in this case about types...)

What would not be compatible are scripts; hence a pandoc option to prevent automatic sectioning for the old format. (so in that sense, Header still exist, but only in "legacy" mode; if a script tries to create a Sec in legacy mode, an error is given; if it tries to create a Header in non-legacy mode, an error is given. I have no idea if this dual mode is precluded in AST's though, so this might be infeasable.)

brainchild0 commented 4 years ago

@kysko While not understanding a few minor details you present, such as how it would be possible or why it would be desirable to use an integer type representing section depth to instead represent a compound iterator for a section number, I would agree that a subset of the currently possible ASTs maps one-to-one with those following a potential design that uses sections as discussed previously. As such, a reversible transformation, a bijection, exists for those two spaces. I would not hold controversial this bit of theoretical musing.

But if a proposed new use of the existing AST types requires changes to the internal and external readers and writers and the filter handlers, in what sense does this use ease the pains of transition from what would be required through a change to the AST types?

I doubt this issue is one of Haskell versus other languages, or of functions versus other constructs. Perhaps you are looking for ways to minimize the complexity of the changes to each component, but as noted, the AST has a central location in relation to other design components, and the overarching concern is to contain the number of components that are changed and the number of times they are changed, other concerns being secondary. Consider if the transitional changes you suggest were adopted, as well as a legacy mode being supported as you suggest. Would there not then be an implication that after the transitional phase were moved to a final one, then a second legacy mode would also be added, representing the transitional phase, and be maintained indefinitely? Wouldn't a plan that minimizes the total accumulation of such legacy modes be far preferred?

kysko commented 4 years ago

such as how it would be possible or why it would be desirable to use an integer type representing section depth to instead represent a compound iterator for a section number

Are you talking about k or n? As I stated, perhaps too quietly: "(Native notation above is not exact, trying to be concise)" If you meant n, I stated: "where n is the kv numbering for this header (like number="1.2.3.4")" That is, I wrote "n" just to quickly express something like 'number="1.2.3.4"' (or rather {number="1.2.3.4"}), which is the current (in the nightlies) key-value created by make_sections to represent the depth (if that option is used).

But (...) Consider (...)

The Div/Header made by make_sections was described as a "hack" to emulate a "Section container" (which implies that it's good enough for this purpose presently), and I was asking how far this "hack" was really from a proper solution, considering the correspondence I tried to illustrate. It was not a rhetorical question.

Sometimes, through shims, one can reduce the "pains of transition"... Is it possible here? Don't know. To quote myself again from above: "I have no idea how naïve the idea is".

Not knowing much about Haskell, nor having a deep knowledge of AST's, I use the native representation to express myself symbolically. I recognize that this may not translate naïvely to the core of the problem. If what I said is senseless, then so be it. But if some tidbits of what I wrote can help, then I'm glad I could contribute something, albeit minuscule.

brainchild0 commented 4 years ago

The Div/Header made by make_sections was described as a "hack" to emulate a "Section container" (which implies that it's good enough for this purpose presently), and I was asking how far this "hack" was really from a proper solution, considering the correspondence I tried to illustrate. It was not a rhetorical question.

My personal position is neutral on the changes related to make_section, but let's take the premise that the new functionality is some sense "good enough" for some purpose. Is it good enough for all possible purposes? As you appear to be suggesting its use for a new purpose, the conclusion of it being good enough for that purpose depends on a separate demonstration.

Sometimes, through shims, one can reduce the "pains of transition"... Is it possible here? Don't know. To quote myself again from above: "I have no idea how naïve the idea is".

I think the broad issue is that the semantic changes are disruptive system-wide, not simply in measures of code complexity and development effort, but in human interactions relating to preserving operational consistency in user deployments generally. A two-phase approach is not easier, in this case, because it doubles the number of transitions without proportionate reduction in pain per transition.

If that explanation is not helping, think about the consideration municipal bodies review when planning road and infrastructure repairs. The cost of a construction project far exceeds funding the workers and equipment, extending also to the disruption in traffic patterns and regular activities of those traveling in the city. Planners will consolidate related repairs into a single project even if the effect is to make each project more entailed.

The general intuition you are applying is not absurd, but the details of the case represent a relevant distinction.

connorp commented 3 years ago

For what it's worth, I agree that it would be better for some purposes to have a Section container in the AST. I didn't do that originally because I was just trying to represent the structure of Markdown. The current use of Divs for this is somewhat of a hack. At some point in the future we might consider adding a Section container (and this could be proposed on jgm/pandoc-types). But AST changes are very painful breaking changes, which require modifications all through the code base and all through the pandoc ecosystem, so I have been very conservative about them.

If/when this change to the AST does get made, I'd encourage it to be capable of representing <article> tags as well as <section> tags, and other ways to differently specify sections that are sections in nature (from the AST's perspective) but have some nuanced differences in the output format.

brainchild0 commented 3 years ago

If/when this change to the AST does get made, I'd encourage it to be capable of representing <article> tags as well as <section> tags

Isn't <article> used to demarcate the main document content from other text that is more toward the intention of decoration, embellishment, or metadata?

connorp commented 2 years ago

Isn't <article> used to demarcate the main document content from other text that is more toward the intention of decoration, embellishment, or metadata?

Not exclusively. In a blog-style page for instance, there could be multiple <article> tags. My understanding is that <section> and <article> are structurally identical, but have different meanings for things like accessibility. So it wouldn't require its own type in the AST, but just have the ability to specify the tag name e.g. via a specified class.

brainchild0 commented 2 years ago

It may be relevant to consider how to handle multiple <article> elements in the same source document, as I was not aware of this possibility as an allowed general case, but interpreting the original language applying the constraint of at most one such element in any document, the immediate issue would be handling content within an <article> element, versus content not within any such element even though one occurs elsewhere in the document.

My general sense would be that if a document has at least one such element, then content outside of any such element is not strictly part of the abstract document.