manubot / rootstock

Clone me to create your Manubot manuscript
https://manubot.github.io/rootstock/
Other
452 stars 178 forks source link

Creating a diff between two manuscript versions #54

Open dhimmel opened 7 years ago

dhimmel commented 7 years ago

Oftentimes, it's important (and required in scholarly publishing) to show the changes between two versions of a manuscript. It would be ideal if Manubot users could "track changes" between two manuscript versions.

Pandoc doesn't have builtin support for diffs: https://github.com/jgm/pandoc/issues/2374. Other options would be:

  1. Exporting to latex and using latexdiff
  2. Exporting to docx and using LibreOffice's Compare Document feature. Currently, not accessible via command line.
  3. Export to ODT and use oodiff
  4. Diffing manuscript.md as a text file (perhaps using diff, prettydiff, or rich-text-diff)
  5. Use GitHub's rich diff view preview or react-rich-diff
dhimmel commented 7 years ago

For the Project Rephetio manuscript, now published in eLife, I had to create diffs to show changes in response to reviewers. I ended up enabling DOCX export (https://github.com/dhimmel/rephetio-manuscript/commit/b7b8bd3a7e0b3c5b3ad4f4f59feb28813126c756), and then using Microsoft Word to compare the documents. While manual and thus sub-optimal, this worked. We may want to consider setting BUILD_DOCX=true by default, so these past DOCX versions are automatically created.

agitter commented 7 years ago

That's good to know you were able to satisfy the journal. Did you not encounter the image embedding problems I did in #40?

I'm okay defaulting to BUILD_DOCX=true if the DOCX versions are not too broken. I think diffing manuscript.md is also appealing in the long term. I haven't tested git diff variants to know how hard it would be to diff the Markdown and color the modified text with a post-diff script.

dhimmel commented 7 years ago

Did you not encounter the image embedding problems I did in #40?

Well we used PNG not SVG images, so they exported to DOCX fine. But in this case the export failure would have been a feature, since the journal required images be uploaded separately!

vsmalladi commented 7 years ago

Should we resurrect #40 to merge?

dhimmel commented 7 years ago

Should we resurrect #40 to merge?

@vsmalladi I'm still leaning against any heavyweight SVG export solution as these are things that really make the most sense to fix upstream. We don't want to place ourselves in a position where we have to maintain this heavy machinery.

vsmalladi commented 7 years ago

@dhimmel that makes sense.

rgieseke commented 7 years ago

Here is another approach used with a GitHub-based project, the COP21 project:

https://github.com/okfn/cop21

https://github.com/okfn/cop21/blob/gh-pages/scripts/diff.sh https://github.com/okfn/cop21/blob/gh-pages/scripts/diff2html.py

Example output http://cop21.okfnlabs.org/diff/4-dec-vs-9-dec/

Still a bit manual for specific document versions but could likely be automated more.

agitter commented 7 years ago

Thanks, the output looks great.

dhimmel commented 7 years ago

Here is another approach used with a GitHub-based project, the COP21 project:

Thanks @rgieseke. To summarize, this method pipes the output of diff --unified=99999 to a python 2 script to create a HTML view.

dhimmel commented 7 years ago

Draftable

I came across the Draftable webapp to create diffs for PDF and DOCX files. Their example showed that it worked well for diffing two arXiv PDFs. They have an API and python package for using the API. To use the API, the free tier is limited to 200 requests per month. API calls return a URL for viewing the diff.

We could potentially use this tool for creating diffs. The URLs could even be embedded into the CI logs, so you could see the changes a PR would create to the PDF output. Obviously, the whole registration / API key / quota / third-party dependency thing kind of sucks.

There may be an open source PDF diff solution that works as well like https://vslavik.github.io/diff-pdf/. Or even create a probot to comment on GitHub PRs with the PDF diff uploaded as an attachment.

slochower commented 6 years ago

I wrote a little notebook that will highlight the differences between two manuscript versions in the HTML and PDF. It is not pretty, but in my limited testing, it seems to do any okay job and I personally like it better than using the external tools listed above. The notebook is here, with the limitations listed at the bottom.

For example, I compared manuscript versions b8eeea542ce238bbcaf2023add2aecb86ef726bd and 5bb8dd518c1f744bbb679d76456d285058bf6b8f of meta-review.

Here is the PDF as of b8eeea542ce238bbcaf2023add2aecb86ef726bd: screen shot 2018-08-04 at 10 06 37 am

Here is the PDF as of 5bb8dd518c1f744bbb679d76456d285058bf6b8f: screen shot 2018-08-04 at 10 08 25 am

And here is manuscript_diff.pdf:

screen shot 2018-08-04 at 10 09 11 am

Which should match git diff:

screen shot 2018-08-04 at 10 10 06 am
dhimmel commented 6 years ago

@slochower nice approach.

I agree that using HTML tags to color portions of the text in the source markdown document may be the right solution. I don't think it's inelegant to put HTML in the markdown (we already do that for manuscripts in places).

However, as you note, tables and figures and some other more complex constructs might be problematic. Also I find the whole line highlighting problematic. It would be much better to get behavior along the lines of git diff --color-words.

I think your approach of using HTML to demarcate markdown source based on git diff output is a promising direction. Were we to refine it a bit more, I think it could be appropriate for Manubot.

slochower commented 6 years ago

Also I find the whole line highlighting problematic. It would be much better to get behavior along the lines of git diff --color-words.

I agree. The issue is getting either vanilla diff or git diff to give us what we need: line numbers and (even better) character level changes in a machine parsable format. I went with regular diff for the proof of concept, because I could get it to easily print which lines have changed. Working with git diff required grep'ing through regular expressions @@.*@@ and was more challenging. I suppose I could always use the patch command from diff to print the original and changed lines, then manually find the differences and print both the old and new versions. I can imagine exactly how this would work if someone changes a word in a sentence, but I can also imagine that large changes could get out of control. What do you think?

It would probably be pretty fragile, but I suppose we could simply parse the ANSI codes that do the coloring in the output of git diff --color-words.

slochower commented 6 years ago

git diff --color-words algorithm: https://github.com/git/git/blob/1d89318c48d233d52f1db230cf622935ac3c69fa/diff.c#L1771-L1801

rgieseke commented 5 years ago

Pandoc doesn't have builtin support for diffs: jgm/pandoc#2374.

I just learned about pandiff, discussed in the thread above, and it seems amazing (Node-based).

https://github.com/davidar/pandiff

jmonlong commented 5 years ago

Adding to the thread that Google Docs now has a feature to compare two documents (in Tools -> Compare documents). So we can build the DOCX output for two versions of the manuscript, upload them to Google Drive, convert them to Google Docs and use this feature.

Just another option like the LibreOffice compare documents. Still manual but some people might prefer Google Docs. The end result is a bit different too so maybe worth trying out if LibreOffice doesn't work properly.

In our experience, going through Google Doc helped with the tables. It was worth it to even upload the "diff" DOCX produced by LibreOffice, just to get the tables right. (Maybe it has to do with my version of LibreOffice on Ubuntu.)

Also, Google Docs doesn't seem to be able to print/export the track-changes in PDF except when printing from Chrome.

castedo commented 2 years ago

To add to the record here, here is a project doing diffs for JATS XML: https://github.com/milos-cuculovic/jats-diff The focus of the project seems to be more on the backend algorithm rather than any particular UI or presentation.