Open dhimmel opened 7 years ago
For the Project Rephetio manuscript, now published in eLife, I had to create diffs to show changes in response to reviewers. I ended up enabling DOCX export (https://github.com/dhimmel/rephetio-manuscript/commit/b7b8bd3a7e0b3c5b3ad4f4f59feb28813126c756), and then using Microsoft Word to compare the documents. While manual and thus sub-optimal, this worked. We may want to consider setting BUILD_DOCX=true
by default, so these past DOCX versions are automatically created.
That's good to know you were able to satisfy the journal. Did you not encounter the image embedding problems I did in #40?
I'm okay defaulting to BUILD_DOCX=true
if the DOCX versions are not too broken. I think diffing manuscript.md
is also appealing in the long term. I haven't tested git diff
variants to know how hard it would be to diff the Markdown and color the modified text with a post-diff script.
Did you not encounter the image embedding problems I did in #40?
Well we used PNG not SVG images, so they exported to DOCX fine. But in this case the export failure would have been a feature, since the journal required images be uploaded separately!
Should we resurrect #40 to merge?
Should we resurrect #40 to merge?
@vsmalladi I'm still leaning against any heavyweight SVG export solution as these are things that really make the most sense to fix upstream. We don't want to place ourselves in a position where we have to maintain this heavy machinery.
@dhimmel that makes sense.
Here is another approach used with a GitHub-based project, the COP21 project:
https://github.com/okfn/cop21/blob/gh-pages/scripts/diff.sh https://github.com/okfn/cop21/blob/gh-pages/scripts/diff2html.py
Example output http://cop21.okfnlabs.org/diff/4-dec-vs-9-dec/
Still a bit manual for specific document versions but could likely be automated more.
Thanks, the output looks great.
Here is another approach used with a GitHub-based project, the COP21 project:
Thanks @rgieseke. To summarize, this method pipes the output of diff --unified=99999
to a python 2 script to create a HTML view.
I came across the Draftable webapp to create diffs for PDF and DOCX files. Their example showed that it worked well for diffing two arXiv PDFs. They have an API and python package for using the API. To use the API, the free tier is limited to 200 requests per month. API calls return a URL for viewing the diff.
We could potentially use this tool for creating diffs. The URLs could even be embedded into the CI logs, so you could see the changes a PR would create to the PDF output. Obviously, the whole registration / API key / quota / third-party dependency thing kind of sucks.
There may be an open source PDF diff solution that works as well like https://vslavik.github.io/diff-pdf/. Or even create a probot to comment on GitHub PRs with the PDF diff uploaded as an attachment.
I wrote a little notebook that will highlight the differences between two manuscript versions in the HTML and PDF. It is not pretty, but in my limited testing, it seems to do any okay job and I personally like it better than using the external tools listed above. The notebook is here, with the limitations listed at the bottom.
For example, I compared manuscript versions b8eeea542ce238bbcaf2023add2aecb86ef726bd
and 5bb8dd518c1f744bbb679d76456d285058bf6b8f
of meta-review
.
Here is the PDF as of b8eeea542ce238bbcaf2023add2aecb86ef726bd
:
Here is the PDF as of 5bb8dd518c1f744bbb679d76456d285058bf6b8f
:
And here is manuscript_diff.pdf
:
Which should match git diff
:
@slochower nice approach.
I agree that using HTML tags to color portions of the text in the source markdown document may be the right solution. I don't think it's inelegant to put HTML in the markdown (we already do that for manuscripts in places).
However, as you note, tables and figures and some other more complex constructs might be problematic. Also I find the whole line highlighting problematic. It would be much better to get behavior along the lines of git diff --color-words
.
I think your approach of using HTML to demarcate markdown source based on git diff output is a promising direction. Were we to refine it a bit more, I think it could be appropriate for Manubot.
Also I find the whole line highlighting problematic. It would be much better to get behavior along the lines of git diff --color-words.
I agree. The issue is getting either vanilla diff
or git diff
to give us what we need: line numbers and (even better) character level changes in a machine parsable format. I went with regular diff
for the proof of concept, because I could get it to easily print which lines have changed. Working with git diff
required grep'ing through regular expressions @@.*@@
and was more challenging. I suppose I could always use the patch command from diff
to print the original and changed lines, then manually find the differences and print both the old and new versions. I can imagine exactly how this would work if someone changes a word in a sentence, but I can also imagine that large changes could get out of control. What do you think?
It would probably be pretty fragile, but I suppose we could simply parse the ANSI codes that do the coloring in the output of git diff --color-words
.
git diff --color-words
algorithm: https://github.com/git/git/blob/1d89318c48d233d52f1db230cf622935ac3c69fa/diff.c#L1771-L1801
Pandoc doesn't have builtin support for diffs: jgm/pandoc#2374.
I just learned about pandiff,
discussed in the thread above, and it seems amazing (Node-based).
Adding to the thread that Google Docs now has a feature to compare two documents (in Tools -> Compare documents). So we can build the DOCX output for two versions of the manuscript, upload them to Google Drive, convert them to Google Docs and use this feature.
Just another option like the LibreOffice compare documents. Still manual but some people might prefer Google Docs. The end result is a bit different too so maybe worth trying out if LibreOffice doesn't work properly.
In our experience, going through Google Doc helped with the tables. It was worth it to even upload the "diff" DOCX produced by LibreOffice, just to get the tables right. (Maybe it has to do with my version of LibreOffice on Ubuntu.)
Also, Google Docs doesn't seem to be able to print/export the track-changes in PDF except when printing from Chrome.
To add to the record here, here is a project doing diffs for JATS XML: https://github.com/milos-cuculovic/jats-diff The focus of the project seems to be more on the backend algorithm rather than any particular UI or presentation.
Oftentimes, it's important (and required in scholarly publishing) to show the changes between two versions of a manuscript. It would be ideal if Manubot users could "track changes" between two manuscript versions.
Pandoc doesn't have builtin support for diffs: https://github.com/jgm/pandoc/issues/2374. Other options would be:
manuscript.md
as a text file (perhaps usingdiff
,prettydiff
, orrich-text-diff
)