github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.12k stars 4.2k forks source link

Protocol Buffers are detected as Pure Basic #3816

Closed vmarkovtsev closed 6 years ago

vmarkovtsev commented 7 years ago

Google Protocol Buffers *.pb files seem to be detected as Pure Basic. For example, https://github.com/bblfsh/libuast

$ linguist .
90.07%  PureBasic
5.68%   Java
3.24%   Python
0.49%   C
0.36%   C++
0.12%   CMake
0.03%   Go
0.01%   Shell
pchaigno commented 7 years ago

Thanks for reporting this and for manually checking :-)

Linguist currently doesn't recognize .pb as a Protocol Buffer extension (only .proto). We might have to add it. However, I don't know much about Protocol Buffer, but the format of these files seems surprising. It's looks very different from what we currently have registered as Protocol Buffer files. What's the difference between these files? Structure description vs. data files?

vmarkovtsev commented 7 years ago

Exactly, those are the binary files with data. The scheme is in .proto.

A quick fix would be to scan for non-ascii chars in .pb and discard binary files.

pchaigno commented 7 years ago

These files should normally be detected as binary. For that, Linguist relies on Charlock Holmes. Charlock Holmes relies on a few heuristic rules for very common formats and falls back to the same detection strategy as git itself for other formats. That last strategy looks for NULL bytes in files. Since your .pb files don't contain any NULL bytes (I checked manually), they're not detected as binary files.

To mitigate this issue locally, you could use Linguist overrides and mark these files as vendored. As such, they won't appear in language statistics anymore.

Alhadis commented 7 years ago

@pchaigno This isn't an isolated case, either. See the discussion at google/fonts#1094; it seems there's a "Text-based version" of the format that's not being accommodated.

@vmarkovtsev Do you know anything about a formal "text-based" version of Protocol Buffer? I'm not at all familiar with Protobuf, I'm afraid.

vmarkovtsev commented 7 years ago

@Alhadis I have never heard about a text-based PB, there is an encoding specification which is binary.

vmarkovtsev commented 7 years ago

I have found https://stackoverflow.com/questions/18873924/what-does-the-protobuf-text-format-look-like

Update: according to the docs, it does exist.

Alhadis commented 7 years ago

Ah right, I remember reading about that when digging up info on the text-based version. Identifying binary-type protobuf shouldn't be a problem. We can add a heuristic to check for control characters and non-ASCII characters. When you say the latter, you are referring to the codepoint range 0x80–0xFF, yes...?

vmarkovtsev commented 7 years ago

Yep, 128-255. At the same time, PureBasic supports UTF-8 encoded source files and requires the BOM in that case.

vmarkovtsev commented 6 years ago

This is no longer reproduced - assumably was intentionally or unintentionally fixed.