Closed naught101 closed 8 years ago
I like the idea.
In principle, it would be possible to create something that did a diff at the level of the pandoc AST. The tool could mark up insertions and deletions in inline contexts by putting them in special Span elements, and in block contexts by putting them in special Div elements. The result could be viewed in HTML using some appropriate CSS, or in other formats. The tool could be made to work on documents in any format pandoc can read.
It would a different from other diff tools in that it only
recognizes differences in the basic structural elements
recognized by pandoc. So, for example, the two markdown
strings *hi*
and _hi_
would not have differences
according to this tool.
It would not be too hard to implement this as a separate program using the pandoc library. You'd probably want to use the Diff library on Hackage for the diff'ing part.
+++ naught101 [Aug 27 15 22:03 ]:
There is a tool called latexdiff, that takes two latex files, and creates a PDF that is a track-changes style diff off the two (e.g. deletions in red, additions in blue or similar).
It would be really nice to have such a thing with pandoc. In particular, because you can make latexdiff play nicely with git, and make nice "this is what I've done since you last saw it" pdf documents that are really useful for showing to paper reviewers.
Mostly I'm only interested in this for markup formats (markdown, rst), at least initially.
I don't know if it makes sense to do something like this withing pandoc, or as a separate script, but even if you're not interested in adding it, I'd be interested to get your thoughts on how best to go about it as a separate script.
— Reply to this email directly or [1]view it on GitHub.
References
I would be really interested in seeing this happen, too. Like the OP I use pandoc to write academic papers, and it is often necessary to show coauthors or reviewers what changes have been done. I experimented with doing the diff on generated LaTeX files using latexdiff, or doing the diff on the Markdown files having wdiff insert some CriticMarkup-inspired marks, but almost always the diff files prove to be not compilable, and repairing them by hand is extremely tedious. Especially wdiff tends not to play well with embedded LaTeX math in the Markdown code. I therefore believe that implementing something like this on the internal AST representation is the way to go. @naught101, have you made any progress on this?
One tricky thing is that the AST is not just a list; it's a tree-like structure that includes list-like structures. The Diff library provides functions to give you diffs of list-like structures, but it's not entirely obvious how to do diffs on trees. There must be prior art on this somewhere.
@jgm, is there documentation for the AST structure, i.e. what kind of nodes can be encountered? I couldn't find any.
https://hackage.haskell.org/package/pandoc-types-1.12.4.7/docs/Text-Pandoc-Definition.html
+++ murfit [Oct 10 15 12:07 ]:
[1]@jgm, is there documentation for the AST structure, i.e. what kind of nodes can be encountered? I couldn't find any.
— Reply to this email directly or [2]view it on GitHub.
References
@murfit you can see what AST pandoc will make of your source by using the -t native flag:
$ pandoc -f latex -t native yourdoc.tex > yourdoc.ast
@jgm, thanks, but that doesn't really explain the meaning and usage of these structure elements. For instance, can I always assume that the first element of the primary list in the JSON form is of type 'unMeta'? And why doesn't it appear at the beginning of the native form? What's the meaning of the extra parameters of a node of type 'Header'? etc.
@technocrat, yes, I've started to look at both the native and the JSON form, but figuring out things from there is reverse engineering, and I can never be sure that any code that I produce will work on arbitrary Markdown documents. And metadata doesn't appear to be even included in the native form, which means I can never reconstruct the full document from that.
@murfit I'm struggling with the same problem; it's all on hackage, but my Haskell reading skills aren't there quite yet, which is why I'm going through the exercise of picking out pieces of the AST and writing filters for them. Right now I'm stuck on RawBlock, but I'm patient.
murfit wrote:
@jgm https://github.com/jgm, thanks, but that doesn't really explain the meaning and usage of these structure elements. For instance, can I always assume that the first element of the primary list in the JSON form is of type 'unMeta'? And why doesn't it appear at the beginning of the native form? What's the meaning of the extra parameters of a node of type 'Header'? etc.
@technocrat https://github.com/technocrat, yes, I've started to look at both the native and the JSON form, but figuring out things from there is reverse engineering, and I can never be sure that any code that I produce will work on arbitrary Markdown documents.
— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2374#issuecomment-147122750.
@murfit, to get a feel for the AST, I'd recommend using
pandoc -t native
to convert some Markdown samples.
The JSON is automatically converted from the native Haskell structure; it's best to get familiar with the latter, as it would probably be easiest to write a pandoc-diff tool in Haskell and have it operate directly on the AST, rather than going through the JSON representation.
+++ murfit [Oct 10 15 13:24 ]:
[1]@jgm, thanks, but that doesn't really explain the meaning and usage of these structure elements. For instance, can I always assume that the first element of the primary list in the JSON form is of type 'unMeta'? And why doesn't it appear at the beginning of the native form? What's the meaning of the extra parameters of a node of type 'Header'? etc.
[2]@technocrat, yes, I've started to look at both the native and the JSON form, but figuring out things from there is reverse engineering, and I can never be sure that any code that I produce will work on arbitrary Markdown documents.
— Reply to this email directly or [3]view it on GitHub.
References
+++ murfit [Oct 10 15 13:24 ]:
[1]@jgm, thanks, but that doesn't really explain the meaning and usage of these structure elements. For instance, can I always assume that the first element of the primary list in the JSON form is of type 'unMeta'? And why doesn't it appear at the beginning of the native form?
When you do pandoc -t native -s
, you'll get the metadata
(the complete Pandoc structure). Without -s
you'll just
get a list of blocks.
@jgm, I understand that Haskell would be the optimal choice, but I don't want to learn a new language just for a single project. I've seen that there is Python support for filters, but for a diff one would need to operate on two files at once. Is there a good Python way to access JSON-formatted AST, preferably with a simple API to traverse the tree, get the JSON representation of subtrees, etc.? (I'm not a Python expert either, but I know the basics and would probably be able to manage.)
An alternative would be to operate on some kind of normalized Markdown, e.g. a form where each sentence or partial sentence is on a single line. The diff then wouldn't go down to the word level, but it would be easier to implement (only two levels: paragraphs and sentences), and the result should still be useful.
There is my jgm/pandocfilters library, which is designed mainly
for writing filters. The walk
function it provides is for
convenient tree walking. The source for toJSONFilter
gives a simple example of its use.
However, I think for this project it would end up being far easier to use Haskell, even if you currently know python better.
Although Haskell has lots of complexities, this sort of project would use only fairly basic features. The hardest part would be algorithmic, figuring out how to do a diff on a Pandoc structure, given a way to do a diff on arbitrary lists (which the Diff library provides).
+++ murfit [Oct 11 15 09:39 ]:
[1]@jgm, I understand that Haskell would be the optimal choice, but I don't want to learn a new language just for a single project. I've seen that there is Python support for filters, but for a diff one would need to operate on two files at once. Is there a good Python way to access JSON-formatted AST, preferably with a simple API to traverse the tree, get the JSON representation of subtrees, etc.? (I'm not a Python expert either, but I know the basics and would probably be able to manage.)
An alternative would be to operate on some kind of normalized Markdown, e.g. a form where each sentence or partial sentence is on a single line. The diff then wouldn't go down to the word level, but it would be easier to implement (only two levels: paragraphs and sentences), and the result should still be useful.
— Reply to this email directly or [2]view it on GitHub.
References
I just found this Haskell library, which might go almost all the way: https://hackage.haskell.org/package/gdiff-1.1/docs/Data-Generic-Diff.html It does use some advance features. I'll look into what would be required.
+++ John MacFarlane [Oct 11 15 14:24 ]:
There is my jgm/pandocfilters library, which is designed mainly for writing filters. The
walk
function it provides is for convenient tree walking. The source fortoJSONFilter
gives a simple example of its use.However, I think for this project it would end up being far easier to use Haskell, even if you currently know python better.
Although Haskell has lots of complexities, this sort of project would use only fairly basic features. The hardest part would be algorithmic, figuring out how to do a diff on a Pandoc structure, given a way to do a diff on arbitrary lists (which the Diff library provides).
+++ murfit [Oct 11 15 09:39 ]:
[1]@jgm, I understand that Haskell would be the optimal choice, but I don't want to learn a new language just for a single project. I've seen that there is Python support for filters, but for a diff one would need to operate on two files at once. Is there a good Python way to access JSON-formatted AST, preferably with a simple API to traverse the tree, get the JSON representation of subtrees, etc.? (I'm not a Python expert either, but I know the basics and would probably be able to manage.)
An alternative would be to operate on some kind of normalized Markdown, e.g. a form where each sentence or partial sentence is on a single line. The diff then wouldn't go down to the word level, but it would be easier to implement (only two levels: paragraphs and sentences), and the result should still be useful.
— Reply to this email directly or [2]view it on GitHub.
References
@murfit Actually your idea of using pandoc (+ maybe some other processing) to produce a canonical document in, say, Markdown, and comparing these makes quite a bit of sense. You could even use a word-level diff algorithm and skip the step of putting each sentence on a line.
I got good results just now comparing two versions of the pandoc README using dwdiff. You can specify the string you want to use as start and end markers for deleted and inserted text. So, for example, you could use <ins>
and <del>
tags if you were targeting HTML.
@jgm, a word diff gives good results only if the changes are to single words or small phrases within a paragraph. Problems occur when changes cross structural borders. A few examples that I found in diffing two revisions of a paper I'm currently working on, using the default change indicators of dwdiff, ([- -], {+ +}); just imagine them being replaced by e.g. <del>
and <add>
elements.
A change at the end of a footnote which is at the end of a paragraph, and after that a heading is inserted:
^[Footnote text with [-change.]-] {+change; see below.]
## Another Section+}
A change at the end of embedded LaTeX math and in the immediately following text:
$a_0 = [-50\,\%$). If-] {+50\,\%$.+}
A change at the end of a reference:
[-@RefA].-] {+@RefA; @RefB].+}
Problems occur also when changes are localized within elements that don't support change markup:
$[-a-] {+b+}$
Another drawback of a pure word-based diff is that it will ignore changes in whitespace, which might be relevant e.g. when one paragraph is being split into two, or two are joined.
And a whole other world of trouble is implied by more complex structures. Imagine a list item is being split into two list items: word diff will mark "-" as an extra word.
I have come up with a strategy to deal with at least most of these problems: 1) Do a diff on the paragraph (block) level. Since the diff algorithm won't be able to detect that a paragraph has been changed, this will always show up as the whole old paragraph being being deleted and the whole new one being inserted. 2) Use a string distance to find candidate pairs of deleted and added paragraphs that are actually the same paragraph being changed. 3) For each changed paragraph, do a word-based diff within. However, for that the diff algorithms has to be taught to treat embedded math, references, footnotes... as "words", i.e. as units that can only be changed as a whole. 4) Some changed blocks can't be treated like this, e.g. metadata blocks, tables, maybe even lists, so it is better to always show them as deleted+added as a whole. (But metadata won't allow even that.)
It would be even better to do this in a recursive way, e.g. compare list blocks by treating them as sequences of list items that have to be matched the same way as paragraphs are on the top level. But I think I'll be glad if I manage to implement 1-4.
Do you have tips on which rules to follow for the "treat some elements as words" part of step3?
I'm currently prototyping in a language that shall remain unnamed (because it's embarassing;), operating on the level of reformatted Markdown source, calling diff and wdiff as external helpers. This might eventually lead to an implementation in bash & friends. If I encounter problems that can't be solved on this level, I may get back to the idea of doing this on the AST representation. How long do you think it would take me to learn the basics of Haskell necessary for this project?
Do you have tips on which rules to follow for the "treat some elements as words" part of step3?
In the AST, it's easy. Not sure how you'd do it in the Markdown representation.
I'm currently prototyping in a language that shall remain unnamed (because it's embarassing;), operating on the level of reformatted Markdown source, calling diff and wdiff as external helpers. This might eventually lead to an implementation in bash & friends. If I encounter problems that can't be solved on this level, I may get back to the idea of doing this on the AST representation. How long do you think it would take me to learn the basics of Haskell necessary for this project?
Learning how to manipulate lists and the sort of algebraic data types you have in the Pandoc AST is not very hard. (In fact, it's SO much easier to do this in Haskell than in many other languages, because of great pattern matching etc.)
That's most of what you'd be doing. The Diff library will take two lists (of any type) as input, and give you as output a list of Diff opjects (marked First, Second, or Both + a text).
Or, it could be done whole hog with cmp if it's necessary to pick up white space:
$ echo 'foo bar' > f1
$ echo 'foobar' > f2
$ cmp -bl f1 f2
4 40 142 b
5 142 b 141 a
6 141 a 162 r
7 162 r 12 ^J
cmp: EOF on f2
but that is going to pick up absolutely everything, which may result in more noise than signal.
I implemented a version of a Markdown diff in Matlab; with a few changes, it should also run under Octave. The diff is text-based and therefore limited; it produces usable output for my specific use case, but is far from being general (which would need the AST representation). It is not ready for publication, but I can provide the code to any individual who is interested in trying it out. Moreover, I believe the basic logic is sound and can form the basis for a more general approach.
@jgm, I'm willing to give a reimplementation in Haskell a shot, and I read some introduction to it. Can you give a little starter's guide? Let's say I have converted a document into native form, which as far as I understand is Haskell code. How do I read it into ghci
, and which libraries do I have to install / load to work with it?
Have you gotten this far?
% cabal update && cabal install pandoc Diff
% ghci
GHCi, version 7.10.1: http://www.haskell.org/ghc/ :? for help
Prelude> :m + Text.Pandoc
Prelude Text.Pandoc> readMarkdown def "Hi!\n\n* World"
Right (Pandoc (Meta {unMeta = fromList []}) [Para [Str "Hi!"],BulletList [[Plain [Str "World"]]]])
Prelude Text.Pandoc> let Right doc = readMarkdown def "Hi!\n\n* World"
Prelude Text.Pandoc> doc
Pandoc (Meta {unMeta = fromList []}) [Para [Str "Hi!"],BulletList [[Plain [Str "World"]]]]
Prelude Text.Pandoc> :m + Data.Algorithm.Diff
Prelude Text.Pandoc Data.Algorithm.Diff> :browse
data Diff a = First a | Second a | Both a a
getDiff :: Eq t => [t] -> [t] -> [Diff t]
getDiffBy :: (t -> t -> Bool) -> [t] -> [t] -> [Diff t]
getGroupedDiff :: Eq t => [t] -> [t] -> [Diff [t]]
getGroupedDiffBy :: (t -> t -> Bool) -> [t] -> [t] -> [Diff [t]]
Prelude Text.Pandoc Data.Algorithm.Diff> getDiff [1,2,3,5,6] [1,3,6]
[Both 1 1,First 2,Both 3 3,First 5,Both 6 6]
Prelude Text.Pandoc Data.Algorithm.Diff> getDiff [Str "Hi", Space, Str "there"] [Str "Hi", Str "there"]
[Both (Str "Hi") (Str "Hi"),First Space,Both (Str "there") (Str "there")]
@jgm, thanks, that's a good start. However, I quickly got stuck somewhere else. I managed to read a markdown file using System.IO.readFile, and now I'd like to pass its string contents to Text.Pandoc.readMarkdown, but the former gives me an IO String while the latter wants a String. Googling about it I found some explanations that converting from an IO String to a String would violate functional purity, and I have some faint idea what that might be about, but the fact remains that I can't even figure out how to parse a markdown text file.
I'm beginning to feel that implementing this in Haskell amounts to more than just putting together some snippets gathered from other people's code, and that I'd need to get into this whole Haskell programming philosophy thing. Which I'm sure is great, but seriously, it's not what I'm interested in right now. So I'd like to ask you again, do you think it makes sense for me to pursue this? Maybe dealing with the JSON representation in some plain old imperative language is the lesser evil here...
I'm learning Haskell the hard way, too, by working on parsing markdown. But it's much easier if you take advantage of the toJSONFilter package that serializes everything to an AST and you can walk through the structure of a doc making changes as you go. Not saying it's easy (still trying to suss out how to get from a Div to a RawBlock to its Str).
import Text.Pandoc.JSON import Data.Char (toTitle)
main :: IO () main = toJSONFilter capitalizeStrings
capitalizeStrings :: Inline -> Inline capitalizeStrings (Str s) = Str (map toTitle s) capitalizeStrings x = x
murfit mailto:notifications@github.com October 21, 2015 at 10:35 AM
@jgm https://github.com/jgm, thanks, that's a good start. However, I quickly got stuck somewhere else. I managed to read a markdown file using System.IO.readFile, and now I'd like to pass its string contents to Text.Pandoc.readMarkdown, but the former gives me an IO String while the latter wants a String. Googling about it I found some explanations that converting from an IO String to a String would violate functional purity, and I have some faint idea what that might be about, but the fact remains that I can't even figure out how to parse a markdown text file.
I'm beginning to feel that implementing this in Haskell amounts to more than just putting together some snippets gathered from other people's code, and that I'd need to get into this whole Haskell programming philosophy thing. Which I'm sure is great, but seriously, it's not what I'm interested in right now. So I'd like to ask you again, do you think it makes sense for me to pursue this? Maybe dealing with the JSON representation in some plain old imperative language is the lesser evil here...
— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2374#issuecomment-149971969.
Sent from Postbox https://www.postbox-inc.com/?utm_source=email&utm_medium=siglink&utm_campaign=reach
@technocrat, thanks, but in my understanding a filter can only operate on one file. For a diff I need two input files.
I can't answer the question whether it's worth your time to get into Haskell. But here's how you can do what you were trying to do:
<$>
will combine an f a
with an a -> b
to yield an f b
, for any functor f. IO is a functor, as well as a monad. So, since readFile "my.md"
is IO String
and readMarkdown def
is String -> Either PandocError Pandoc
, readMarkdown def <$> readFile "my.md"
is IO (Either PandocError Pandoc)
.
Note: you've still got a value in the IO monad: once you're in, you can't get out. But <$>
allows you to apply a function that takes a plain String as argument across the monad boundary, on an IO String.
+++ murfit [Oct 21 15 10:35 ]:
@jgm, thanks, that's a good start. However, I quickly got stuck somewhere else. I managed to read a markdown file using System.IO.readFile, and now I'd like to pass its string contents to Text.Pandoc.readMarkdown, but the former gives me an IO String while the latter wants a String. Googling about it I found some explanations that converting from an IO String to a String would violate functional purity, and I have some faint idea what that might be about, but the fact remains that I can't even figure out how to parse a markdown text file.
I'm beginning to feel that implementing this in Haskell amounts to more than just putting together some snippets gathered from other people's code, and that I'd need to get into this whole Haskell programming philosophy thing. Which I'm sure is great, but seriously, it's not what I'm interested in right now. So I'd like to ask you again, do you think it makes sense for me to pursue this? Maybe dealing with the JSON representation in some plain old imperative language is the lesser evil here...
Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/2374#issuecomment-149971969
Closing this. It's a good idea for a separate project but doesn't belong on this tracker.
Actually your idea of using pandoc (+ maybe some other processing) to produce a canonical document in, say, Markdown, and comparing these makes quite a bit of sense. You could even use a word-level diff algorithm and skip the step of putting each sentence on a line.
I've put together a script that does this (using HTML as the canonical intermediate format), if anyone is still interested in this issue:
You are a benefactor to text analysts everywhere. Thank you.
On May 1, 2018 at 5:11:07 AM, David A Roberts (notifications@github.com) wrote:
Actually your idea of using pandoc (+ maybe some other processing) to produce a canonical document in, say, Markdown, and comparing these makes quite a bit of sense. You could even use a word-level diff algorithm and skip the step of putting each sentence on a line.
I've put together a script that does this (using HTML as the canonical intermediate format), if anyone is still interested in this issue:
https://github.com/davidar/pandiff
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jgm/pandoc/issues/2374#issuecomment-385657110, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEjmIzWeN_DiBBZFEqsx0ifYwA2UUpBks5tuFDagaJpZM4Fz1HZ .
Nice, you might add a link to this to the pandoc-extras wiki page.
David A Roberts notifications@github.com writes:
Actually your idea of using pandoc (+ maybe some other processing) to produce a canonical document in, say, Markdown, and comparing these makes quite a bit of sense. You could even use a word-level diff algorithm and skip the step of putting each sentence on a line.
I've put together a script that does this (using HTML as the canonical intermediate format), if anyone is still interested in this issue:
https://github.com/davidar/pandiff
-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/2374#issuecomment-385657110
There is a tool called latexdiff, that takes two latex files, and creates a PDF that is a track-changes style diff off the two (e.g. deletions in red, additions in blue or similar).
It would be really nice to have such a thing with pandoc. In particular, because you can make latexdiff play nicely with git, and make nice "this is what I've done since you last saw it" pdf documents that are really useful for showing to paper reviewers.
Mostly I'm only interested in this for markup formats (markdown, rst), at least initially.
I don't know if it makes sense to do something like this within pandoc, or as a separate script, but even if you're not interested in adding it, I'd be interested to get your thoughts on how best to go about it as a separate script.