Open mfenner opened 10 years ago
This is probably desirable but technically difficult to get right because there isn't a specific place in the AST for insertions and deletions.
On Sun, Aug 24, 2014 at 2:17 PM, Martin Fenner notifications@github.com wrote:
The new docx reader converts docx track changes into HTML. I suggest the following changes:
- use and
instead of and `", as they are standard HTML tags and will render in most browsers.- don't use a wrapper including author and date for combined insertions/deletions, rather use author and date attributes on both the and
tag.— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/1560.
We have the Spans in the AST. I think it would just be a matter of changing
the HTML writer so that it renders these spans with <del>
and <ins>
rather than the spans. Probably the author and date attributes should
be included on whatever tags are used.
+++ mpickering [Aug 24 14 06:39 ]:
This is probably desirable but technically difficult to get right because there isn't a specific place in the AST for insertions and deletions. On Sun, Aug 24, 2014 at 2:17 PM, Martin Fenner notifications@github.com wrote:
The new docx reader converts docx track changes into HTML. I suggest the following changes:
- use and
instead of and `", as they are standard HTML tags and will render in most browsers.- don't use a wrapper including author and date for combined insertions/deletions, rather use author and date attributes on both the and
tag.— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/1560.
— Reply to this email directly or [1]view it on GitHub.
References
I get a bit worried about the sustainability of this approach in general especially as it causes unexpected results in a user has "insertion" class but also because you lose type guarantees and it makes code more difficult to understand in someone is unfamiliar.
It does add some complexity. @mfenner, why not just add some CSS rules for these span elements, so they're displayed as you like?
+++ mpickering [Aug 24 14 08:40 ]:
I get a bit worried about the sustainability of this approach in general especially as it causes unexpected results in a user has "insertion" class but also because you lose type guarantees and it makes code more difficult to understand in someone is unfamiliar.
— Reply to this email directly or [1]view it on GitHub.
References
One way to do this without an AST change would be to special-case these spans in the HTML writer.
One way to do this without an AST change would be to special-case these spans in the HTML writer.
I think a dedicated element would work better in the long run.. considering for example how the image/figure hack turned out, I think it's better to include more AST elements.
I've broadened the scope of this issue. If we decide to handle insertions/deletions better across pandoc, there are a few formats to consider.
In HTML and markdown, pandoc 2.4 currently does:
$ echo '<ins>foo</ins>' | pandoc -f html -t native
[Plain [Span ("",["underline"],[]) [Str "foo"]]]
$ echo '<del>foo</del>' | pandoc -f html -t native
[Plain [Strikeout [Str "foo"]]]
$ echo '~~foo~~' | pandoc -f markdown -t native
[Para [Strikeout [Str "foo"]]]
Looking at CriticMarkup, from this closed issue:
critic markup HTML LaTeX {--[text]--}
<del>[text]</del>
\st{[text]}
{++[text]++}
<ins>[text]</ins>
\underline{[text]}
{~~[text1]~>[text2]~~}
<del>[text1]</del><ins>[text2]</ins>
\st{[text1]}\underline{[text2]}
{==[text]==}
<mark>[text]</mark>
\hl{[text]}
{>>[text]<<}
<aside>[text]</aside>
\marginpar{[text]}
See also the LaTeX changes package.
And of course the existing docx --track-changes
option. (In the code, look for AcceptChanges
in the Docx reader.)
If we go with an AST change, it might seem like we simply could:
Strikeout
element with a Deleted
elementInserted
elementHowever, Strikeout
is an Inline
element, so it's not clear how to handle changes that span multiple blocks. From pandoc-discuss:
It's not that easy. What kind of native output would this produce?
- My first list item. - My second {-- list item. - My third --} combined list item.
Presumably a list with three items, the second and third of which contain these special StartDelete and StopDelete markers. But that doesn't tell you the structure you want after applying the edits -- which is a list with two items. Special logic for doing this sort of transformation would need to be included somewhere. And there are much more complex cases I could come up with.
It seems the fundamental question is whether this should be modelled in the pandoc AST like the HTML <ins>
and <del>
elements (which are block-level elements, but nonetheless force you to serialize the changes to a tree), or whether it should be more like a plain-text diff kind of thing, where we have starting and ending markers that can cross nodes in the pandoc AST (that's more what CiriticMarkup does). But if it's the second, you can just use an external preprocessor to diff your changes and handle the markers (like pancritic), and pandoc wouldn't have to have anything to do with it.
So, input welcome on how different formats handle this. Especially, I'm unclear on how docx --track-changes and LaTeX handle:
It's not that easy. What kind of native output would this produce?
- My first list item. - My second {-- list item. - My third --} combined list item.
As mentioned in that thread, the CriticMarkup spec actually recommends against these kinds of AST-breaking changes:
Avoid Newlines in CriticMarkup
Wrap Markdown Tags Completely
While it may support incomplete Markdown tags in the future, the CriticMarkup processor currently chokes on them. Avoid this:
I really love *italic {~~fonts*~>font-styles*~~}.
Instead, wrap the asterisks completely:
I really love {~~*italic fonts*~>*italic font-styles*~~}.
The one exception to this seems to be for insertion/deletion of paragraph breaks, I'm not sure if there's a clean way to handle that case.
Don’t read too much into the current CriticMarkup “spec” though. It is not a proper parser but just a bunch of regex. (FYI I’m maintaining a fork of the reference implementation of CriticMarkup in the form of pancritic which provides Python 3 support among other things.)
Multiline (ie Newlines crossing) shouldn’t be a big problem but “boundary crossing” between markdown markups and CriticMarkup is.
If it is decided to support CriticMarkup (as a syntax, not its spec) in pandoc then we got to eventually decided on how it should behaves (ie a spec) which will most likely different from current one.
Is it much work to spec out these issues? CriticMarkup is becoming a standard among Markdown editors, but it is difficult to pass it through standard pandoc
.
I would argue for an AST based pandoc intrinsic implementation. Thanks to ickc the preporcessor variant has been around for some while, but it could gain wider audience if it was just a pandoc filter like the docx
->criticmarkup
lua filter [1]. The multi author writing process could benefit a lot from round trips like docx
->criticmarkup
->docx
[1] https://gist.github.com/noamross/12e67a8d8d1fb71c4669cd1ceb9bbcf9
Also criticmarkup
's comments seem more like an <aside>
, while docx
comments can refer to a whole body of text and are translated to [This is a Comment]{.start-comment}highlighted text[]{.end-comment}
by --track-changes
. Someone has proposed something like ?[highlighted text](This is a comment)
instead of {>>This is a comment<<}
. No idea if that is really necessary.
The new docx reader converts docx track changes into HTML. I suggest the following changes:
<ins>
and<del>
instead of<span class="insertion">
and `", as they are standard HTML tags and will render in most browsers.<span>
including author and date for combined insertions/deletions, rather use author and date attributes on both the<ins>
and<del>
tag.