r-three / git-theta

git extension for {collaborative, communal, continual} model development
Apache License 2.0
206 stars 9 forks source link

Consider writing or repurposing gitattributes parsing/manipulation library #76

Open craffel opened 2 years ago

craffel commented 2 years ago

If we are doing a lot of manipulating or parsing of gitattributes files, we might want to roll that out into a separate (well-tested) library, or try to rely on another library for that if possibe.

blester125 commented 2 years ago

I took a crack at making a grammar for parsing .gitattributes lines with the lark library. I doesn't have out of the box support for outputting text from a parse tree, but if we add that we could manipulate attts with a parse -> munge tree (add a filter=theta node) -> output new text line.

One thing I noticed it that the gitattributes pattern matching is a bit more complex than cli glob matching. For example it supports [[:space:]] for a space character and automatically translates between \ in filenames to a / in patterns. Right now we are using fnmatch.fnmatchcase for pattern match checking between a file and the current git attributes so we are going to have some false negatives right now (which is ok as the behavior for no match adds a new attribute whose pattern is an exact match for the file)

?start: line

line : comment
     | pattern " " [attr (" " attr)*]

comment: "#"/.+/?

pattern : C_ESCAPED_STRING
        | UNESCAPED_PATTERN

C_ESCAPED_STRING : "\"" CHAR_SEQUENCE "\""
CHAR_SEQUENCE : CHAR+
CHAR : BASIC_CHAR
     | ESCAPE
BASIC_CHAR : /[^"\\]/
ESCAPE : "\'"
       | "\""
       | "\?"
       | "\\"
       | "\a"
       | "\b"
       | "\f"
       | "\n"
       | "\r"
       | "\t"
       | "\v"
       | /\\\d{3,}/
       | /\\o\d*/
       | /\\x\d*/
       | /\\u\d{4,}/
       | /U\d{8,}/
       | "\N{" N_CHAR_SEQUENCE "}"

N_CHAR_SEQUENCE : N_CHAR+
N_CHAR : "A".."Z"
       | "0".."9"
       | "-"
       | " "

UNESCAPED_PATTERN : /[^ \t\f\r\n"]/+

attr : set
     | unset
     | set_value

set : EFFECT
unset : "-"EFFECT
set_value : EFFECT "=" value

EFFECT : "text"
       | "eol"
       | "working-tree-encoding"
       | "ident"
       | "filter"
       | "diff"
       | "merge"
       | "whitespace"
       | "export-ignore"
       | "export-subst"
       | "delta"
       | "encoding"
       | "binary"

value : UNESCAPED_PATTERN

There is enough complexity that I think it warrants some dedicated work and testing. But it seems complex enough (and our default actions when we don't processes it as correctly as we should are reasonable) that it will end up being pretty high effort.

blester125 commented 2 years ago

WRT the python stdlib version of fnmatch not handling named character classes like [:space:] I found this library https://facelessuser.github.io/wcmatch/fnmatch/ which does handle them. So adding correct pattern matching to an abstracted gitattributes library will be a bit easier.

craffel commented 2 years ago

Cool! So just to clarify, you think it's worth writing a dedicated gitattributes parsing/modifying library, but that it will be a good amount of work? I'm also a little confused as to how writing out will work if lark doesn't support going from a parse tree back to syntax.

blester125 commented 2 years ago

writing out will work if lark doesn't support going from a parse tree back to syntax.

We would need to write something like a tree visitor that outputs the right substrings based on tree nodes, would be annoying but not too hard (esp if ok with some small visual changes like removal of extra spaces). Or we can find a library that supports round tripping out of the box.

So just to clarify, you think it's worth writing a dedicated gitattributes parsing/modifying library,

I am undecided (boo, I know). There are definitely edge cases when working with gitattributes that we are mishandling, but our fallback behavior results in creating very specific gitattribute rules which override things we miss. So the result is things are still happening correctly, just in a round-about way.

So a library would be good from the aspect of being more correct, but it wouldn't necessarily fix anything that is broken now, or protect us that much in the future.

craffel commented 2 years ago

Ok, maybe this is an if-it-ain't-broke-don't-fix-it situation - we can punt on this until after the proof of concept is done, and then revisit it later on (especially if someone is really excited about writing such a library)?