sharkdp / bat

A cat(1) clone with wings.
Apache License 2.0
49.1k stars 1.24k forks source link

`Rd (R Documentation)` file is not detected correctly #1357

Open lebensterben opened 3 years ago

lebensterben commented 3 years ago

What version of bat are you using? bat 0.16.0

Describe the bug you encountered: When use bat on a .Rd file, it cannot correctly render it as an R documentation file. I have to manually add --language=rd.

Describe what you expected to happen? This should be automatically detected.

The output of bat -L shows

Rd (R Documentation) rd

And I've verified that when the extension becomes rd instead of Rd, it'd be correctly recognised by `bat.

Even though both Rd and rd file would be recognised as text/x-r-doc by xdg-mime, the extension of R documentation files is Rd in the official repository of R. For example: https://svn.r-project.org/R/trunk/doc/NEWS.Rd

How did you install bat? cargo install bat


[paste the output of info.sh here]

system

$ uname -srm Linux 5.9.1-992.native x86_64

bat

$ bat --version bat 0.16.0

$ env PAGER=less

$ bat --list-languages Found custom syntax set.

$ bat --list-themes Found custom theme set.

bat_config

$ cat /home/lucius/.config/bat/config

--theme="doom_one"
--italic-text=always

bat_wrapper

No wrapper script for 'bat'.

bat_wrapper_function

No wrapper function for 'bat'.

No wrapper function for 'cat'.

tool

$ less --version less 551 (POSIX regular expressions)

eth-p commented 3 years ago

The official Sublime syntax for R Documentation only specifies rd as a valid file extension:

https://github.com/sublimehq/Packages/blob/759d6eed9b4beed87e602a23303a121c3a6c2fb3/R/Rd%20(R%20Documentation).sublime-syntax#L6

And yet, Sublime itself will detect a file with a .Rd extension perfectly fine:

image

In the same vein, Sublime will detect MAKEFILE as a Make file, while bat defaults to plain text. This is definitely a "bug" when we compare it to Sublime's syntax detection.


@sharkdp @keith-hall, I think this should be fixed. Official and unofficial .sublime-syntax definitions are going to rely on Sublime Text's implementation of extension matching, and we're going to run into more issues like this unless we make bat consistent with Sublime. Do you guys agree?

A took a bit of time to experiment with how Sublime deals with conflicting extensions, and this is what I determined:

  1. If the extension is an exact match for only one syntax, Sublime will pick that syntax.

  2. If the extension is an exact match for multiple syntaxes, Sublime will pick whichever syntax is sorted later by its name. I'm not sure if this is explicit, or if it's a consequence of how my filesystem sorts directory entries.

    For example, on my system I have Custom Syntax 2.sublime-syntax taking priority over Custom Syntax 1.sublime-syntax.

  3. Otherwise, if the extension is a case-insensitive match for one syntax, Sublime will pick that.

  4. If the extension is a case-insensitive match for multiple syntaxes, it will select the syntax based on the process in (2).

I'm not sure if first-line detection takes place anywhere before, during, or after that process. More experimentation will be necessary to figure that part out.


Related: https://github.com/sharkdp/bat/issues/807#issuecomment-576436035

keith-hall commented 3 years ago

I think this should be fixed. Official and unofficial .sublime-syntax definitions are going to rely on Sublime Text's implementation of extension matching, and we're going to run into more issues like this unless we make bat consistent with Sublime. Do you guys agree?

Yes, I completely agree. Sounds like the best place to fix this is in syntect, then. It's a shame that Sublime doesn't allow explicitly setting whether an extension should be case sensitive or not - perhaps it would make sense if an all lowercase definition in file_extensions meant case insensitive and anything else would mean sensitive.

2. If the extension is an exact match for multiple syntaxes, Sublime will pick whichever syntax is sorted later by its name. I'm not sure if this is explicit, or if it's a consequence of how my filesystem sorts directory entries.

It's explicit. Syntect behaves the same. https://github.com/trishume/syntect/pull/217 https://docs.sublimetext.io/guide/extensibility/packages.html#merging-and-order-of-precedence

lebensterben commented 3 years ago

There is a bit difference for MAKEFILE and Rd.

First, the definition of the an R documentation file exactly matches both Rd and rd extensions.

<?xml version="1.0" encoding="utf-8"?>
<mime-type xmlns="http://www.freedesktop.org/standards/shared-mime-info" type="text/x-r-doc">
  <!--Created automatically by update-mime-database. DO NOT EDIT!-->
  <sub-class-of type="text/plain"/>
  <comment>R Documentation File</comment>
  <glob pattern="*.Rd"/>
  <glob pattern="*.rd"/>
</mime-type>

But for Makefile, it's glob pattern doesn't include "MAKEFILE"

<?xml version="1.0" encoding="utf-8"?>
<mime-type xmlns="http://www.freedesktop.org/standards/shared-mime-info" type="text/x-makefile">
  <!--Created automatically by update-mime-database. DO NOT EDIT!-->
  <comment>Makefile</comment>
......
  <sub-class-of type="text/plain"/>
  <glob pattern="makefile"/>
  <glob pattern="GNUmakefile"/>
  <glob pattern="*.mk"/>
  <glob pattern="*.mak"/>
  <glob weight="10" pattern="Makefile.*"/>
</mime-type>

BUT, as of Shared MIME-info Database specification 0.21, it also explicitly says

Applications MUST match globs case-insensitively, except when the case-sensitive attribute is set to true. This is so that e.g. main.C will be seen as a C++ file, but IMAGE.GIF will still use the *.gif pattern.

In my opinion the only correct and safe way to infer the type of a file is via it's mime info. This could be done via xdg-mime or xdgmime for example.

keith-hall commented 3 years ago

So we would need to maintain a mapping of mime types to sublime-syntax files. I'm not sure if that is something we want to do, even though it sounds like the most accurate approach. If only the sublime-syntax format was open and we could suggest changes to it so it could be handled there ;)

Edit: plus I don't know offhand how portable it would be - does Windows have decent mime type reporting support? Somehow I think not, but could be wrong - maybe through an additional dependency only...?

sharkdp commented 3 years ago

As mentioned in #807 (reference above), there are cases like this:

Objective-C:m
Objective-C++:mm,M

Apparently, *.m files are Objective C while *.M files are Objective C++. On my machine, I did not find a mime description file for Objective C++. The one for Objective C has

<?xml version="1.0" encoding="utf-8"?>
<mime-type xmlns="http://www.freedesktop.org/standards/shared-mime-info" type="text/x-objcsrc">
…
  <sub-class-of type="text/x-csrc"/>
  <glob pattern="*.m"/>
</mime-type>

without any case-sensitive attribute.

lebensterben commented 3 years ago

@sharkdp According to Apple's documentation

Xcode requires that file names have a “.mm” extension for the Objective-C++ extensions to be enabled by the compiler.

Also a good thing about querying system mime database is, the system administrator can add/modify it. So even if the current specification doesn't fit a user's needs well, it's possible to update the mime database to reflect the requirements such as case-sensitivity.

Therefore, I think bat should have the ability to query system mime database, either automatically or being invoked by the user, to update the "mapping" of the mime type and sublime-syntax type.

lebensterben commented 3 years ago

Keeping a mapping inside bat's source could be an option. But alternatively we can generate such mapping when bat is installed on a user's system by querying mime database.

In normal case when bat is able to detect a file's type by its extension, bat probably should keep doing that.

If it cannot determine the type and if mime database is available, it can then rely on the mapping and mime query to determine the type.

Since the database is mutable, bat may need to update its internal mapping periodically. I imagine it won't be very complicated:

eth-p commented 3 years ago

I'm skeptical about the whole idea of adding syntax detection via MIME querying directly to bat.

While it would be useful for accuracy, I think it's important that we keep bat's syntax detection behavior similar to Sublime's. We're piggybacking off the Sublime ecosystem for syntax highlighting, and intentionally diverging from it isn't going to do us any favors in compatibility.

Additionally, it's going to be an issue from maintenance standpoint. Using operating system APIs to query the MIME type of a file is going to make bat's behavior much more dependent on system configuration. For example, what if a user files a bug report about incorrect syntax detection because their system has a bad extension to MIME mapping?

Finally, adding on to @keith-hall's compatibility concerns: How would we handle a MIME querying API that only accepts files? That wouldn't be possible with pipes unless we buffer the data and save it somewhere, which goes against the core principle that bat should be unbuffered.

eth-p commented 3 years ago

Since the database is mutable, bat may need to update its internal mapping periodically. I imagine it won't be very complicated:

  • In sublime syntax specification, for every component in the file_extensions field, we can make a dummy file with random contents and query its mime type.

Unfortunately, this wouldn't be enough. I don't know about Linux, but part of the MIME type detection process on MacOS involves reading the file contents and attempting to match them with rules relating to that specific type. We would need valid files to query the MIME type properly.

sharkdp commented 3 years ago

Ok, so what should be done here? Report (and fix) the issue in syntect?

keith-hall commented 3 years ago

Ok, so what should be done here? Report (and fix) the issue in syntect?

Yes, I believe so.