phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.
Other
8.21k stars 177 forks source link

"Binary" overrides custom adapters? #206

Closed awilkins closed 9 months ago

awilkins commented 9 months ago

Describe the bug

Declared the following custom adapter to stream Outlook's .msg format into '.eml`

    {
      "name": "outlook",
      "version": 1,
      "description": "Uses `msgconvert` to stream .msg files into .eml",
      "extensions": [
        "msg"
      ],
      // First mime-type is what file reports, second one is also used, doesn't work with or without these
      "mimetypes": [ "application/vnd.ms-office", "appplication/vnd.ms-outlook" ],
      "binary": "msgconvert",
      "args": [
        "${input_virtual_path}",
        "--outfile",
        "-"
      ],
      "disabled_by_default": false,
      "match_only_by_mime": false,
      "output_path_hint": "${input_virtual_path}.eml"
    }

If I pre-convert the .msg into an .eml file, rga happily converts it and finds my search string.

Also works if I convert the file to STDOUT and pipe it into rga

❯ msgconvert "changes.msg" --outfile - 2> /dev/null | rga  Emergency
Any Emergency changes will be reviewed
<snip>

If I run rga on my file path directly I get

Binary file matches (found "\u{0}" byte around offset 8)

If I run rga on my folder, I get no hits.

My hypothesis here is that the decision that "this is a binary file and I won't search it" is being made before the choice to apply custom adapters (but presumably not integrated adapters?)

Operating System and Version

Ubuntu 20.04

Output of rga --version

ripgrep-all 0.10.6

Workaround

# Happily there are no "original" .eml files in my folder
fd \.msg -x msgconvert {} --outfile {.}.eml
awilkins commented 9 months ago

Hmm, OK, I noticed that pandoc conversions weren't being run and it turns out that

"adapters": [ "mail"]

Disables all the other adapters ...

Feel like this is a misunderstanding - the schema reports the help text for the command line arg

Change which adapters to use and in which priority order (descending)\n\n\"foo,bar\" means use only adapters foo and bar. \"-bar,baz\" means use all default adapters except for bar and baz. \"+bar,baz\" means use all default adapters and also bar and baz.

It's not immediately apparent how to convert the CLI args pattern to jsonc

"adapters": [ "+mail" ]

... has the desired effect. My extra step (for Outlook mails) is working fine and even chaining to the inbuilt eml filter now.