kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.75k stars 491 forks source link

`parsePatch` should preserve "leading garbage" #454

Closed ExplodingCabbage closed 1 month ago

ExplodingCabbage commented 7 months ago

Here's an example of a patch emitted by git diff:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 20b807a..4a96aff 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -2,6 +2,8 @@

 ## Pull Requests

+bla bla bla
+
 We also accept [pull requests][pull-request]!

 Generally we like to see pull requests that
diff --git a/README.md b/README.md
index 06eebfa..40919a6 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
 # jsdiff

+foo
+
 [![Build Status](https://secure.travis-ci.org/kpdecker/jsdiff.svg)](http://travis-ci.org/kpdecker/jsdiff)
 [![Sauce Test Status](https://saucelabs.com/buildstatus/jsdiff)](https://saucelabs.com/u/jsdiff)

@@ -225,3 +227,5 @@ jsdiff deviates from the published algorithm in a couple of ways that don't affe

 * jsdiff keeps track of the diff for each diagonal using a linked list of change objects for each diagonal, rather than the historical array of furthest-reaching D-paths on each diagonal contemplated on page 8 of Myers's paper.
 * jsdiff skips considering diagonals where the furthest-reaching D-path would go off the edge of the edit graph. This dramatically reduces the time cost (from quadratic to linear) in cases where the new text just appends or truncates content at the end of the old text.
+
+bar

Parse it with parsePatch and you get this:

[
  {
    oldFileName: 'a/CONTRIBUTING.md',
    oldHeader: '',
    newFileName: 'b/CONTRIBUTING.md',
    newHeader: '',
    hunks: [ [Object] ]
  },
  {
    oldFileName: 'a/README.md',
    oldHeader: '',
    newFileName: 'b/README.md',
    newHeader: '',
    hunks: [ [Object], [Object] ]
  }
]

The stuff before each pair of filenames in the diff has vanished - i.e. this text is nowhere to be seen anywhere in the object returned by parsePatch:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 20b807a..4a96aff 100644
diff --git a/README.md b/README.md
index 06eebfa..40919a6 100644

If all we want to do with the parsed patch is apply it, this is probably fine. Content in this part of a unified patch file seems to not follow any kind of consistent, specced format and not affect how to actually apply the patch, and is consequently referred to by the patch man page as "leading garbage"(!). But if we want to tweak and reserialize a patch, leaving the garbage unchanged (perhaps for the sake of some other tool that in some way appreciates the garbage), then discarding the garbage upon parsing breaks our ability to do that.

It would therefore be desirable, if possible, to preserve the leading and trailing garbage (perhaps even in leadingGarbage and trailingGarbage properties, just to be totally clear that as far as we're concerned it's just arbitrary text that happened to be in the patch and has no semantics).

ExplodingCabbage commented 1 month ago

Not planned any more for reasons outlined in https://github.com/kpdecker/jsdiff/pull/522#issuecomment-2186725674. This is actually super-complicated to do in a non-shit way.