specfy / stack-analyser

Extract 500+ technologies from any repository. Detect Languages, SaaS, Cloud, Infrastructure, Dependencies and Services
https://specfy.io
MIT License
190 stars 10 forks source link

[BUG] gettext's MO translation files are reported as Modelica #90

Open zed opened 20 hours ago

zed commented 20 hours ago

Describe the bug Localization files such as mkdocs/themes/readthedocs/locales/tr/LC_MESSAGES/messages.mo are reported as Modelica language

https://www.gnu.org/software/gettext/manual/html_node/Files.html

To Reproduce

Take any project that uses gettext for i18n. For example, a project that uses mkdocs to generate its docs:

# curl -LsSf https://astral.sh/uv/install.sh | sh  # install uvx
uvx cookiecutter --no-input https://github.com/fpgmaas/cookiecutter-uv.git  # create project with mkdocs docs
cd example-project && make docs-test  # generate docs including MO translation files

Tech stack analyzer reports Modelica language is used:

npx @specfy/stack-analyser .
jq '..| .languages?.Modelica // empty' output.json

but these are actually gettext's MO translation files:

fd -HI '\.mo$'

Desktop:

npx @specfy/stack-analyser --version
1.8.5
bodinsamuel commented 18 hours ago

Hey @zed Thanks for reporting this issue. The extensions list I use for this library is coming directly from GitHub https://github.com/github-linguist/linguist/blob/5a0c74277548122267d84283910abd5e0b89380e/lib/linguist/languages.yml

They don't seem to have added support for gettext .mo (but .po is supported) but I could add it for sure. One big problem is that except from looking at the content, I'm not sure there is a way to differentiate between the two extensions. Do you have a suggestion that would help?

zed commented 16 hours ago

gettext's MO files are binary (they can be tested by the presence of zero byte). Motoko, Modelica .mo files are text (programming code) -- they can't contain zero bytes unless UTF-16, UTF-32 encodings are used (not sure how likely for Motoko, Modelica source code to use such encoding).

Another way is to test the first 4 bytes:

The first two words serve the identification of the file. The magic number will always signal GNU MO files. The number is stored in the byte order used when the MO file was generated, so the magic number really is two numbers: 0x950412de and 0xde120495. https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html

>>> open('messages.mo', 'rb').read(4) in map(bytes.fromhex, ["95 04 12 de", "de 12 04 95"])
True

As I understand, the linguist project ignores binary files, therefore there is no ambiguity for it https://github.com/github-linguist/linguist/issues/2053