jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.55k stars 3.38k forks source link

Handle insertions and deletions (<ins> and <del>) consistently across readers/writers #1560

Open mfenner opened 10 years ago

mfenner commented 10 years ago

The new docx reader converts docx track changes into HTML. I suggest the following changes:

mpickering commented 10 years ago

This is probably desirable but technically difficult to get right because there isn't a specific place in the AST for insertions and deletions.

On Sun, Aug 24, 2014 at 2:17 PM, Martin Fenner notifications@github.com wrote:

The new docx reader converts docx track changes into HTML. I suggest the following changes:

  • use and instead of and `", as they are standard HTML tags and will render in most browsers.
  • don't use a wrapper including author and date for combined insertions/deletions, rather use author and date attributes on both the and tag.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/1560.

jgm commented 10 years ago

We have the Spans in the AST. I think it would just be a matter of changing the HTML writer so that it renders these spans with <del> and <ins> rather than the spans. Probably the author and date attributes should be included on whatever tags are used.

+++ mpickering [Aug 24 14 06:39 ]:

This is probably desirable but technically difficult to get right because there isn't a specific place in the AST for insertions and deletions. On Sun, Aug 24, 2014 at 2:17 PM, Martin Fenner notifications@github.com wrote:

The new docx reader converts docx track changes into HTML. I suggest the following changes:

  • use and instead of and `", as they are standard HTML tags and will render in most browsers.
  • don't use a wrapper including author and date for combined insertions/deletions, rather use author and date attributes on both the and tag.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/1560.

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/1560#issuecomment-53193321
mpickering commented 10 years ago

I get a bit worried about the sustainability of this approach in general especially as it causes unexpected results in a user has "insertion" class but also because you lose type guarantees and it makes code more difficult to understand in someone is unfamiliar.

jgm commented 10 years ago

It does add some complexity. @mfenner, why not just add some CSS rules for these span elements, so they're displayed as you like?

+++ mpickering [Aug 24 14 08:40 ]:

I get a bit worried about the sustainability of this approach in general especially as it causes unexpected results in a user has "insertion" class but also because you lose type guarantees and it makes code more difficult to understand in someone is unfamiliar.

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/1560#issuecomment-53197299
jgm commented 7 years ago

One way to do this without an AST change would be to special-case these spans in the HTML writer.

mb21 commented 7 years ago

One way to do this without an AST change would be to special-case these spans in the HTML writer.

I think a dedicated element would work better in the long run.. considering for example how the image/figure hack turned out, I think it's better to include more AST elements.

mb21 commented 6 years ago

I've broadened the scope of this issue. If we decide to handle insertions/deletions better across pandoc, there are a few formats to consider.

In HTML and markdown, pandoc 2.4 currently does:

$ echo '<ins>foo</ins>' | pandoc -f html -t native
[Plain [Span ("",["underline"],[]) [Str "foo"]]]

$ echo '<del>foo</del>' | pandoc -f html -t native
[Plain [Strikeout [Str "foo"]]]

$ echo '~~foo~~' | pandoc -f markdown -t native
[Para [Strikeout [Str "foo"]]]

Looking at CriticMarkup, from this closed issue:

critic markup HTML LaTeX
{--[text]--} <del>[text]</del> \st{[text]}
{++[text]++} <ins>[text]</ins> \underline{[text]}
{~~[text1]~>[text2]~~} <del>[text1]</del><ins>[text2]</ins> \st{[text1]}\underline{[text2]}
{==[text]==} <mark>[text]</mark> \hl{[text]}
{>>[text]<<} <aside>[text]</aside> \marginpar{[text]}

See also the LaTeX changes package.

And of course the existing docx --track-changes option. (In the code, look for AcceptChanges in the Docx reader.)


If we go with an AST change, it might seem like we simply could:

However, Strikeout is an Inline element, so it's not clear how to handle changes that span multiple blocks. From pandoc-discuss:

It's not that easy. What kind of native output would this produce?

- My first list item. 
- My second {-- list item. 
- My third --} combined list item. 

Presumably a list with three items, the second and third of which contain these special StartDelete and StopDelete markers. But that doesn't tell you the structure you want after applying the edits -- which is a list with two items. Special logic for doing this sort of transformation would need to be included somewhere. And there are much more complex cases I could come up with.

It seems the fundamental question is whether this should be modelled in the pandoc AST like the HTML <ins> and <del> elements (which are block-level elements, but nonetheless force you to serialize the changes to a tree), or whether it should be more like a plain-text diff kind of thing, where we have starting and ending markers that can cross nodes in the pandoc AST (that's more what CiriticMarkup does). But if it's the second, you can just use an external preprocessor to diff your changes and handle the markers (like pancritic), and pandoc wouldn't have to have anything to do with it.

So, input welcome on how different formats handle this. Especially, I'm unclear on how docx --track-changes and LaTeX handle:

davidar commented 5 years ago

It's not that easy. What kind of native output would this produce?

- My first list item. 
- My second {-- list item. 
- My third --} combined list item. 

As mentioned in that thread, the CriticMarkup spec actually recommends against these kinds of AST-breaking changes:

Avoid Newlines in CriticMarkup

Wrap Markdown Tags Completely

While it may support incomplete Markdown tags in the future, the CriticMarkup processor currently chokes on them. Avoid this:

I really love *italic {~~fonts*~>font-styles*~~}.

Instead, wrap the asterisks completely:

I really love {~~*italic fonts*~>*italic font-styles*~~}.

The one exception to this seems to be for insertion/deletion of paragraph breaks, I'm not sure if there's a clean way to handle that case.

ickc commented 5 years ago

Don’t read too much into the current CriticMarkup “spec” though. It is not a proper parser but just a bunch of regex. (FYI I’m maintaining a fork of the reference implementation of CriticMarkup in the form of pancritic which provides Python 3 support among other things.)

Multiline (ie Newlines crossing) shouldn’t be a big problem but “boundary crossing” between markdown markups and CriticMarkup is.

If it is decided to support CriticMarkup (as a syntax, not its spec) in pandoc then we got to eventually decided on how it should behaves (ie a spec) which will most likely different from current one.

mgajda commented 5 years ago

Is it much work to spec out these issues? CriticMarkup is becoming a standard among Markdown editors, but it is difficult to pass it through standard pandoc.

ttxtea commented 3 years ago

I would argue for an AST based pandoc intrinsic implementation. Thanks to ickc the preporcessor variant has been around for some while, but it could gain wider audience if it was just a pandoc filter like the docx->criticmarkup lua filter [1]. The multi author writing process could benefit a lot from round trips like docx->criticmarkup->docx

[1] https://gist.github.com/noamross/12e67a8d8d1fb71c4669cd1ceb9bbcf9

Also criticmarkup's comments seem more like an <aside>, while docx comments can refer to a whole body of text and are translated to [This is a Comment]{.start-comment}highlighted text[]{.end-comment} by --track-changes. Someone has proposed something like ?[highlighted text](This is a comment) instead of {>>This is a comment<<}. No idea if that is really necessary.