the-mikedavis / tree-sitter-diff

A tree-sitter grammar for `diff` output
https://the-mikedavis.github.io/tree-sitter-diff/
MIT License
37 stars 11 forks source link

Support `git format-patch` diffs #1

Closed ObserverOfTime closed 2 years ago

ObserverOfTime commented 2 years ago

git format-patch uses --- to separate the commit body from the summary. This parser expects that to be part of an old_file node and returns an error.

Here's a random kernel patch and the syntax tree it produces: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=da3b1c2

source [0, 0] - [36, 0]
  context [0, 0] - [0, 70]
  context [1, 0] - [1, 33]
  context [2, 0] - [2, 36]
  context [3, 0] - [3, 71]
  context [4, 0] - [4, 21]
  context [6, 0] - [6, 63]
  context [7, 0] - [7, 67]
  context [8, 0] - [8, 7]
  context [10, 0] - [10, 102]
  context [11, 0] - [11, 42]
  context [12, 0] - [12, 39]
  context [13, 0] - [13, 67]
  context [14, 0] - [14, 44]
  ERROR [15, 0] - [15, 3]
  context [16, 0] - [16, 77]
  context [17, 0] - [17, 46]
  command [19, 0] - [19, 154]
    filename [19, 11] - [19, 154]
  index [20, 0] - [20, 41]
    commit [20, 6] - [20, 19]
    commit [20, 21] - [20, 34]
    mode [20, 35] - [20, 41]
  old_file [21, 0] - [21, 75]
    filename [21, 4] - [21, 75]
  new_file [22, 0] - [22, 75]
    filename [22, 4] - [22, 75]
  location [23, 0] - [23, 29]
    linerange [23, 3] - [23, 8]
    linerange [23, 9] - [23, 14]
  context [24, 0] - [24, 64]
  context [26, 0] - [26, 18]
  deletion [27, 0] - [27, 22]
  addition [28, 0] - [28, 28]
  context [29, 0] - [29, 17]
  context [31, 0] - [31, 10]
  context [32, 0] - [32, 3]
  context [33, 0] - [33, 5]
the-mikedavis commented 2 years ago

This grammar focuses only on diff syntax. Patch syntax should be covered in a separate grammar which injects this diff grammar.

ObserverOfTime commented 2 years ago

Patch syntax is diff syntax with --- in a context line. (and some metadata which don't have to be handled by this grammar)

the-mikedavis commented 2 years ago

I can't find a formal definition for patch and/or diff (if you can find these, please do link them).

My understanding is that patch syntax is a superset of diff - it's additional metadata surrounding a diff section. There is no way to not handle part of a document with tree-sitter: everything must be parsed as something, so adding support for patch would mean parsing all of the patch syntax.

A separate tree-sitter-patch grammar would be a reasonable way to approach this, and it could parse the metadata.

ObserverOfTime commented 2 years ago

I can't find a formal definition for patch and/or diff

GNU diffutils supports several distinct formats:

Ideally, this parser would handle the first three formats. (maybe in separate grammars?)


My understanding is that patch syntax is a superset of diff

I suppose you're right as per the Kernel documentation. It's in fact a combination of mbox and unified diff.


There is no way to not handle part of a document with tree-sitter

I know, but you can parse the --- marker as a context node to avoid the error. At least until someone (maybe me?) writes a grammar for the patch format.

the-mikedavis commented 2 years ago

Yeah, the error node should be cleaned up. I have fixed the parsing to produce a deletion node though rather than context in 6d7c499bc4d26cdc527ddc2d0026d23aac14f97a, since --- may mark the deletion of a line containing only "--".

I don't use patch format myself so I am uninterested in maintaining a grammar for it. This grammar should serve as a decent example if you would like to tackle it yourself though.