Closed vmarkovtsev closed 6 years ago
Thanks for reporting this and for manually checking :-)
Linguist currently doesn't recognize .pb
as a Protocol Buffer extension (only .proto
). We might have to add it. However, I don't know much about Protocol Buffer, but the format of these files seems surprising. It's looks very different from what we currently have registered as Protocol Buffer files. What's the difference between these files? Structure description vs. data files?
Exactly, those are the binary files with data. The scheme is in .proto
.
A quick fix would be to scan for non-ascii chars in .pb
and discard binary files.
These files should normally be detected as binary. For that, Linguist relies on Charlock Holmes. Charlock Holmes relies on a few heuristic rules for very common formats and falls back to the same detection strategy as git itself for other formats. That last strategy looks for NULL bytes in files. Since your .pb
files don't contain any NULL bytes (I checked manually), they're not detected as binary files.
To mitigate this issue locally, you could use Linguist overrides and mark these files as vendored. As such, they won't appear in language statistics anymore.
@pchaigno This isn't an isolated case, either. See the discussion at google/fonts#1094
; it seems there's a "Text-based version" of the format that's not being accommodated.
@vmarkovtsev Do you know anything about a formal "text-based" version of Protocol Buffer? I'm not at all familiar with Protobuf, I'm afraid.
@Alhadis I have never heard about a text-based PB, there is an encoding specification which is binary.
I have found https://stackoverflow.com/questions/18873924/what-does-the-protobuf-text-format-look-like
Update: according to the docs, it does exist.
Ah right, I remember reading about that when digging up info on the text-based version. Identifying binary-type protobuf shouldn't be a problem. We can add a heuristic to check for control characters and non-ASCII characters. When you say the latter, you are referring to the codepoint range 0x80–0xFF
, yes...?
Yep, 128-255. At the same time, PureBasic supports UTF-8 encoded source files and requires the BOM in that case.
This is no longer reproduced - assumably was intentionally or unintentionally fixed.
Google Protocol Buffers
*.pb
files seem to be detected as Pure Basic. For example, https://github.com/bblfsh/libuast