github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.32k stars 4.27k forks source link

Detect .ply as binary #6973

Closed Walther closed 3 months ago

Walther commented 3 months ago

PLY (file format)

PLY is a computer file format known as the Polygon File Format or the Stanford Triangle Format. It was principally designed to store three-dimensional data from 3D scanners. The data storage format supports a relatively simple description of a single object as a list of nominally flat polygons. A variety of properties can be stored, including color and transparency, surface normals, texture coordinates and data confidence values. The format permits one to have different properties for the front and back of a polygon.

There are two versions of the file format, one in ASCII, the other in binary.

Currently, there are over 50 thousand .ply files found on the GitHub code search. These are treated as code, and show up in statistics for lines added, pull request change sizes, and so on.

Even though the files can technically be ASCII, they are almost universally generated by 3D modeling programs as output, used by 3D rendering programs as input, and easily go up to millions of lines in length.

Unfortunately, even the various workarounds proposed in the documentation and other relevant issues do not help. The following attempted settings in .gitattributes have no effect.

# attempt 1
*.ply binary
# attempt 2
*.ply binary
ply/** linguist-vendored
# attempt 3
ply/**/*.ply binary linguist-generated
# attempt 4
ply/**/*.ply -diff -merge -text linguist-generated

A pull request with some test models added shows up as 3 million lines of code added, potentially ruining any statistics and insights over time.

lildude commented 3 months ago

Unfortunately, there is no way of marking a file as binary and this has nothing to do with Linguist which is only used to detect the language; GitHub does not read the binary git attribute.

The closest you can get with Linguist is to mark the file as generated:

*.ply linguist-generated

This will only suppress the content in diffs and won't count the files towards the language statistics.

That said, GitHub won't render the content of very large files in the diff by default anyway so the only benefit you'll get is the files won't be counted in the language stats. This won't however change the diff stats as these are based on the actual content of the file.

Interestingly Linguist doesn't even know about this language so it won't appear in the language stats of a repo nor get syntax highlighting. If you've got the time, we're happy to accept a PR that adds support for the language and marks the files as generated by default. This

Walther commented 3 months ago

Unfortunately, this does not seem to have any effect. gitattributes ply

Screenshot 2024-07-30 at 12 17 06

Compare to a glTF .bin file in another PR gltf-bin

Screenshot 2024-07-30 at 12 11 41
lildude commented 3 months ago

Unfortunately, this does not seem to have any effect.

Indeed. I stated this...

This won't however change the diff stats as these are based on the actual content of the file.