Error when `user.name` or `user.email` are not set for staged commits

Hi, gitlint is great, I'm trying to convert a project over to use it now (replacing a handful of shell one-liners).

Some checks in the project that I'm trying to convert are to verify that the developer has a name and email set in git. I am not 100% convinced that these checks are valuable but to test them I have been commenting out these variables in my git config file. When I comment out either or both and run gitlint, I get:

An error occurred while executing '/usr/bin/git config --get user.name': b''

I guess gitlint assumes that name and email are set. For my purposes, that's fine because I'm replacing checks that also ensure that these variables are set. However, I think the error message could be a little clearer.

I'm running gitlint through pre-commit using a command line like ~/.local/bin/pre-commit run --hook-stage commit-msg --commit-msg-filename test.msg.

/home/ethan/.cache/pre-commit/repok43dfskl/py_env-python3/bin/gitlint --debug --msg-filename test.msg
DEBUG: gitlint.cli To report issues, please visit https://github.com/jorisroovers/gitlint/issues
DEBUG: gitlint.cli Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.29
DEBUG: gitlint.cli Python version: 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0]
DEBUG: gitlint.git ('--version',)
DEBUG: gitlint.cli Git version: git version 2.25.1
DEBUG: gitlint.cli Gitlint version: 0.14.0dev
DEBUG: gitlint.cli GITLINT_USE_SH_LIB: [NOT SET]
DEBUG: gitlint.cli DEFAULT_ENCODING: UTF-8
DEBUG: gitlint.cli Configuration
config-path: /home/ethan/[projectname]/.gitlint
[GENERAL]
extra-path: None
contrib: []
ignore: title-max-length,title-must-not-contain-words,body-max-line-length,body-min-length,body-is-missing
ignore-merge-commits: True
ignore-fixup-commits: True
ignore-squash-commits: True
ignore-revert-commits: True
ignore-stdin: False
staged: True
verbosity: 3
debug: True
target: /home/ethan/Jobs/Teachable/fedora
[RULES]
  I1: ignore-by-title
     ignore=all
     regex=None
  I2: ignore-by-body
     ignore=all
     regex=None
  I3: ignore-body-lines
     regex=None
  T1: title-max-length
     line-length=72
  T2: title-trailing-whitespace
  T6: title-leading-whitespace
  T3: title-trailing-punctuation
  T4: title-hard-tab
  T5: title-must-not-contain-word
     words=WIP
  T7: title-match-regex
     regex=.*
  B1: body-max-line-length
     line-length=80
  B5: body-min-length
     min-length=20
  B6: body-is-missing
     ignore-merge-commits=True
  B2: body-trailing-whitespace
  B3: body-hard-tab
  B4: body-first-line-empty
  B7: body-changed-file-mention
     files=
  B8: body-match-regex
     regex=None
  M1: author-valid-email
     regex=[^@ ]+@[^@ ]+\.[^@ ]+

DEBUG: gitlint.cli Fetching additional meta-data from staged commit
DEBUG: gitlint.cli Using --msg-filename.
DEBUG: gitlint.git ('config', '--get', 'core.commentchar')
DEBUG: gitlint.cli Linting 1 commit(s)
DEBUG: gitlint.lint Linting commit [SHA UNKNOWN]
DEBUG: gitlint.git ('config', '--get', 'user.name')
An error occurred while executing '/usr/bin/git config --get user.name': b''

Arguably, gitlint should be robust against these variables being missing because git will try to autodetect them if they are missing, as well as use environment variables like GIT_AUTHOR_NAME when present which are not reflected by git config. However, if you fix this behavior, I would love to see a gitlint rule that verifies that the user name is set :)

Thanks for opening the issue :-)

So this can only occur when linting a staged commit, since gitlint will otherwise take the username from the git log which is independent of your current config.

As you experienced, in case of staged commits, gitlint falls back on git config user.name (same for user.email). As an aside, this is relatively new behavior - prior to 0.13 gitlint would just not have this meta-information for staged commits.

I fully agree gitlint can do a better job at falling back to GIT_AUTHOR_NAME or something else, I'll work on that (at some point...). As always, it's a bit of trade-off - I don't think gitlint should completely re-implement git's logic around this, that's likely to be a rabbit hole.

I do wonder whether it's better to hard fail (like is currently the case) or silently fail (i.e. username will be None or empty and just not usuable by rules). I think a hard fail might actually be better, at least then it's very clear something's wrong.

For the record, the most concise way to reproduce this:

# inside a git repo:
git config --unset user.name
echo "foo" > bar.txt
gitlint --staged --msg-filename bar.txt

Wrt implementing a rule to implement that user.name is set, that's fairly straight-forward using a user-defined rule.

I've wanted to add it to gitlint's built-in rules before, just haven't gotten around to it. I'll happily accept a PR for this :-) (see contributing). Longer term, I think it would be nice to have a more generic rule that can validate any git config value.

I think a hard fail is best too. That kind of precludes writing a rule that checks that it's set.

Can I interest you in a simple PR that just produces a cleaner error message when these variables are missing?

I dug in a little more and it looks like the call to author_name and author_email is triggered by a debug call in lint. If it weren't for that call I think everything would have succeeded. This feels a little brittle to me. If we are going to do a hard-fail then I think we should always do it, regardless of whether we get converted to Unicode or not. The obvious way to do this would be to invoke these properties on construction of the StagedGitCommit but that necessitates other changes to the tests and I'm not sure that's worth it.

The git config --unset user.name within a repository only works if there isn't a global setting in e.g. ~/.config/git/config. I'm not sure how to trigger it as part of an integration test. Trying git config --add user.name '' just means that the name field is an empty string, which is different from failing to retrieve it if it's completely unset (as above).

Again, thanks for creating that PR - just merged it.

I dug in a little more and it looks like the call to author_name and author_email is triggered by a debug call in lint. If it weren't for that call I think everything would have succeeded. This feels a little brittle to me. If we are going to do a hard-fail then I think we should always do it, regardless of whether we get converted to Unicode or not. The obvious way to do this would be to invoke these properties on construction of the StagedGitCommit but that necessitates other changes to the tests and I'm not sure that's worth it.

Ok, so I think there's a few things at play here.

Since the last version of gitlint, we do lazy evaluation of git properties, i.e. for the most part we don't actually retrieve information from git until it's being used in some way (we also cache them). This prevents us from doing "expensive" extraneous calls to git to fetch information that might not end up being used. While this isn't so important when linting a single commit, it can add up when linting a long range of commits using --commits. For staged commits, it might be debateable whether that benefit is still there, but in I believe the principle of minimizing interaction with git to when it's needed is an overall good design goal.
In principle, this should be true for fetching the username and email as well: we don't actually fetch it until we need it or we want to check a rule against it. This allows users to selectively disable the specific rule and allow repos without user.name or user.email set if they wish so. Gitlint shouldn't raise an error in that case because it shouldn't be fetching that information in the first place. As called out before, we can use an M2:author-valid-name rule that enforces that a user.name is actually set and maybe matches a regex (similar to M1:author-valid-email which exists today).
You make an excellent point about the brittleness of it being exposed via a debug log: that's not something I had considered and IMO that needs to change. As I mentioned in the previous point: debug shouldn't be accessing that property at all unless gitlint is using it somewhere else (i.e. in the M2:author-valid-name rule). I think the bigger point here is that all debug logging should be lazy as well: we shouldn't be doing any fetching of git info or properties unless --debug is actually set. I think we'll need to do a proper "audit" of the code to come up with a solution and fix this everywhere.

jorisroovers / gitlint

Error when `user.name` or `user.email` are not set for staged commits #148