htdebeer / pandocomatic

Automate the use of pandoc
https://heerdebeer.org/Software/markdown/pandocomatic/
GNU General Public License v3.0
158 stars 14 forks source link

Weird error when parsing .docx files #116

Closed pro-arch-user closed 1 month ago

pro-arch-user commented 1 month ago

I'm trying to use the command "pandocomatic -c .\config.yaml -o output_dir -i test". The directory "test" has a few .docx files. Here is my "config.yaml": settings: recursive: true follow-symlinks: false skip: ['.', 'pandocomatic.yaml'] match-files: 'first' templates: templatew: glob: ['.docx']

When I run this it throws an error: [UNEXPECTED ERROR] An unexpected error has occurred. You can report this bug via https://github.com/htdebeer/pandocomatic/issues/new. C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:238:in `scan': invalid byte sequence in UTF-8 (ArgumentError)

    starts = input.scan(BLOCK_START)
                        ^^^^^^^^^^^
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:238:in `extract_blocks'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:206:in `initialize'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:198:in `new'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:198:in `extract_metadata'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:69:in `load'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandoc_metadata.rb:55:in `load_file'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/command/convert_file_multiple_command.rb:52:in `initialize'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/command/convert_dir_command.rb:91:in `new'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/command/convert_dir_command.rb:91:in `block in initialize'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/command/convert_dir_command.rb:72:in `foreach'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/command/convert_dir_command.rb:72:in `initialize'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandocomatic.rb:84:in `new'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/lib/pandocomatic/pandocomatic.rb:84:in `run'
    from C:/Ruby33-x64/lib/ruby/gems/3.3.0/gems/pandocomatic-1.1.3/bin/pandocomatic:3:in `<top (required)>'
    from C:/Ruby33-x64/bin/pandocomatic:32:in `load'
    from C:/Ruby33-x64/bin/pandocomatic:32:in `<main>'

Please help

htdebeer commented 1 month ago

Thanks for reporting this issue! I can reproduce the issue on my end. Looks like pandocomatic tries to recognize YAML metadata blocks in a non-markdown / non-plain text format, and fails.

I haven't used DOCX as an input format before, so I didn't run into this problem. It seems likely that the issue also occurs for other non-plain text input formats. Anyway, pandocomatic should support any input format, so I'll investigate the issue to come up with a fix.

pro-arch-user commented 1 month ago

Yeah thanks man. Maybe I should add some kinda "ignore metadata" param my template?

Btw do you have a batch/bash script that recursively iterates over all files in a directory and performs a command on them? Because that way I could just do if (fileextention = .docx) pandoc file_name -f docx -t markdown -o file_name for each file.

htdebeer commented 1 month ago

Something like

find . -name *.docx -print0 | xargs -0 -I{} pandoc "{}" -f docx -t markdown -o "{}.md"

might work?

htdebeer commented 1 month ago

But be careful, if you forget the ".md" in "{}.md", your original files might get overwritten. Maybe apply this command on a copy of your files instead of the originals. Just to be safe.

htdebeer commented 1 month ago

Maybe I should add some kinda "ignore metadata" param my template?

That doesn't exists, at the moment. Might be a nice feature to have: Have a template setting to skip looking inside files for detailed pandoc and pandocomatic configuration.

pro-arch-user commented 1 month ago

Maybe I should add some kinda "ignore metadata" param my template?

That doesn't exists, at the moment. Might be a nice feature to have: Have a template setting to skip looking inside files for detailed pandoc and pandocomatic configuration.

It would be cool if you add this

pro-arch-user commented 1 month ago

Something like

find . -name *.docx -print0 | xargs -0 -I{} pandoc "{}" -f docx -t markdown -o "{}.md"

might work?

I am too dumb for that shit haha

pro-arch-user commented 1 month ago

But be careful, if you forget the ".md" in "{}.md", your original files might get overwritten. Maybe apply this command on a copy of your files instead of the originals. Just to be safe.

Yeah I have backups including a copy on a usb stick so it should be good

pro-arch-user commented 1 month ago

I ended up making my own script https://github.com/pro-arch-user/Pandoc-Directory-Convert

htdebeer commented 1 month ago

I've been looking into the issue and discovered that more things go awry when using DOCX, or any non-plain text input format with pandocomatic. I seem to have build pandocomatic around the implicit assumption that we convert only plain text source files.

I will look into this further, but expect a solution to take a while.

pro-arch-user commented 1 month ago

I've been looking into the issue and discovered that more things go awry when using DOCX, or any non-plain text input format with pandocomatic. I seem to have build pandocomatic around the implicit assumption that we convert only plain text source files.

I will look into this further, but expect a solution to take a while.

Docx is ass. I used my script and finally switched my notes to obsidian. So much better now.

htdebeer commented 1 month ago

Fixed issue. Will be in next version of pandocomatic (1.2.0), but release will wait until I've made more changes.

If you want to test before release is published, checkout master branch, and use "test/pandocomatic.rb" as the pandocomatic program. I.e., to run the scenario reported in this ticket, run:

/path/you/cloned/pandocomatc/repo/test/pandocomatic.rb -c .\config.yaml -o output_dir -i test
htdebeer commented 1 month ago

I've fixed issue by only extracting pandoc metadata YAML blocks from markdown files. If pandocomatic doesn't yet know a file's source format, it uses pandoc's default mapping from file extension to source format. In either case, this'd mean that DOCX files will not be mined for pandoc YAML metadata blocks.

In case you use an uncommon file extension for your markdown files, you can use setting extract-metadata-from in your pandocomatic configuration files to tell pandocomatic to also extract pandoc metadata YAML blocks from these files. For example, if you call your markdown files "my_document.pandoc", you can configure:

settings:
  # ...
  extract-metadata-from: ['*.pandoc']
  # ...

Note that this configuration does not override or disable extracting pandoc metadata YAML blocks from markdown files recognized as such by pandocomatic or pandoc. I.e., all files names "*.md" will still be mined for metadata.