phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.
Other
6.4k stars 148 forks source link

pandoc adapter fails on *.htm files #205

Open tionis opened 5 months ago

tionis commented 5 months ago

Describe the bug When searching across ebooks the pandoc adapter fails due to an Unknown input format htm error in pandoc. I originally wanted to solve this by defined as custom adapters but then both are in conflict with each other.

Possible Solution Instead of doing something like

Command { std: "pandoc" "--from=htm" "--to=plain" "--wrap=none" "--markdown-headings=atx", kill_on_drop: false }

do something like that

Command { std: "pandoc" "--from=html" "--to=plain" "--wrap=none" "--markdown-headings=atx", kill_on_drop: false }

for htm files.

Operating System and Version Manjaro Linux

Output of rga --version ripgrep-all 0.10.6

KlyithSA commented 2 months ago

This custom adapter works without conflicting with normal html:

        {
            "name": "htm custom",
            "version": 1,
            "description": "fix for https://github.com/phiresky/ripgrep-all/issues/205",

            "extensions": ["htm"],
            "mimetypes": ["application/x-extension-htm"],

            "binary": "pandoc",
            "args": ["--from=html", "--to=plain", "--wrap=none", "--markdown-headings=atx"],
            "disabled_by_default": false,
            "match_only_by_mime": false
        }
tionis commented 2 months ago

Oh right, I forgot to update the issue! I figured out a similar config, but the standard config should probably still handle this correctly. Thanks though!

g-berthiaume commented 2 months ago

Hi! I think I just encountered the same issue.

G:\...\myfile.htm.txt adapter: postprocprefix
Unknown input format htm
Error: copying adapter output to stdout

Caused by:
    0: subprocess: Command { std: "pandoc" "--from=htm" "--to=plain" "--wrap=none" "--markdown-headings=atx", kill_on_drop: false }
    1: ExitStatus(ExitStatus(21))

The custom adapter doesn't work for me. That said, it could be my fault: I never used custom adapters before.

tionis commented 2 months ago

You have to add the config for the custom adapter in the rga-config. Depending on system configuration the location may vary, but normally it should be at ~/.config/ripgrep-all/config.jsonc. Mine for example looks like this:

{
  "$schema": "./config.schema.json",
  "custom_adapters": [
    {
      "name": "htm-pandoc",
      "version": 1,
      "description": "Uses pandoc to transform htm files",
      "extensions": ["htm"],
      "mimetypes": ["application/x-extension-htm","application/htm"],
      "binary": "pandoc",
      "args": ["--from=html", "--to=plain", "--wrap=none", "--markdown-headings=atx"],
      "disabled_by_default": false,
      "match_only_by_mime": false
    },
    {
      "name": "gron",
      "version": 1,
      "description": "Transform JSON into discrete JS assignments",
      "extensions": ["json"],
      "mimetypes": ["application/json"],
      "binary": "gron",
      "args": [],
      "disabled_by_default": false,
      "match_only_by_mime": false
    }
  ]
}