github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.11k stars 4.2k forks source link

Binary test file is treated as valid contributions, giving +400000 lines of code in pull request. #5131

Closed ni4 closed 3 years ago

ni4 commented 3 years ago

Preliminary Steps

Please confirm you have...

Problem Description

I'm pushing large binary test file to the repository, which should not be detected as code contribution. However, even with linguist-vendored and linguist-generated in .gitattributes PR is still displaying +408171 lines of code in the diff.

Test file is located at src/tests/data/test_fuzz_verify_detached/somelongname. .gitattributes is put at src/tests/data. Related .gitattributes line is: test_fuzz_verify_detached/* linguist-generated linguist-vendored. ( I tried to change it to * linguist-vendored, * linguist-generated with no luck).

URL of the affected repository:

PR: https://github.com/rnpgp/rnp/pull/1378

lildude commented 3 years ago

PR is still displaying +408171 lines of code in the diff.

No it's not. It's reporting "changes":

CleanShot 2021-01-07 at 15 27 18

... which is not necessarily the number of lines changed and also has nothing to do with Linguist, so anything Linguist related in the .gitattributes will have no effect at all.

It's important to remember that you're essentially looking at a nice friendly web view of git diff in the "Files changed" tab of a PR and that's where the number of "changes" comes from as we can see if we look at the diff using git alone:

$ git clone -q git@github.com:rnpgp/rnp.git
$ cd rnp
$ git switch ni4-oss-fuzz-26318-out-of-memory
Switched to branch 'ni4-oss-fuzz-26318-out-of-memory'
Your branch is up to date with 'origin/ni4-oss-fuzz-26318-out-of-memory'.
$ git diff master -- src/tests/data/test_fuzz_verify_detached/clusterfuzz-testcase-minimized-fuzz_verify_detached-5749353995829248 | head -6
diff --git src/tests/data/test_fuzz_verify_detached/clusterfuzz-testcase-minimized-fuzz_verify_detached-5749353995829248 src/tests/data/test_fuzz_verify_detached/clusterfuzz-testcase-minimized-fuzz_verify_detached-5749353995829248
new file mode 100644
index 00000000..d9e8e5c4
--- /dev/null
+++ src/tests/data/test_fuzz_verify_detached/clusterfuzz-testcase-minimized-fuzz_verify_detached-5749353995829248
@@ -0,0 +1,408016 @@ <--- 🎉 HERE'S THE SOURCE OF YOUR BIG FIGURE
$

And now you know a little more about how diffs work on GitHub and where that figure comes from 😀

ni4 commented 3 years ago

Thanks for spending your time to answer! From the documentation I read I got a feeling that Linguist is also responsible for detecting which files are included in diff and which are ignored. Do you have any hint on how to ignore some files on user contributions calculation? I.e. for 3 years I've got only +128k changes on the repository 'Contributors' page, and do not want to add +400k on top of this :-)

lildude commented 3 years ago

Linguist can only be used to suppress files shown in the diff (ie show all the lines or show the "Load diff" message) or ignore them when determining the language breakdown of a repo. It can't be used to completely ignore files being changed and I don't believe there is any way of suppressing the count on the contributions page as you really did make those changes at the git repo level. This is a side effect of committing large binary files to a repo.

ni4 commented 3 years ago

@lildude Thanks again. Strangely, even force-setting binary flag in .gitattributes doesn't help to exclude file from the calculation. Will seek for the workaround...