jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.62k stars 3.38k forks source link

Converting Doc/Docx/ODT to X fails: invalid charset, iconv won't convert. #2983

Closed zeryx closed 8 years ago

zeryx commented 8 years ago

trying to convert ta locally made doc file (made via openoffice) to either html or markdown, both conversions fail due to: pandoc: Cannot decode byte '\xff': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

file -bi charset output:

$ file -bi test.doc 
application/msword; charset=binary

iconv -f binary -t utf-8 output:

$ iconv -f binary -t utf-8 test.doc -o testutf8.doc
iconv: conversion from `binary' is not supported

Therefore it seems like there is no possible way to use a doc file with pandoc, unless I'm missing something critical, any help would be greatly appreciated!

PS: this also fails on parsing odt files & docx files output from openoffice -> save as, all other formats work fine

mb21 commented 8 years ago

Pandoc cannot read doc files, only docx and odt. Can you open the files in question in word or open/libre office? Can you post a link to them?

jkr commented 8 years ago

And pandoc only started being able to read docx and odt files in the last couple of years. What version (pandoc -v) are you running?

zeryx commented 8 years ago

Nevermind, docx and odt are viable parse from & parse to formats, I needed to pass the -f and -t flags to have it properly function.

just FYI, my output from pandoc -v is:

$ pandoc -v
pandoc 1.16.0.2
Compiled with texmath 0.8.4.1, highlighting-kate 0.6.1.
Syntax highlighting is supported for the following languages:
    abc, actionscript, ada, agda, apache, asn1, asp, awk, bash, bibtex, boo, c,
    changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css,
    curry, d, diff, djangotemplate, dockerfile, dot, doxygen, doxygenlua, dtd,
    eiffel, email, erlang, fasm, fortran, fsharp, gcc, glsl, gnuassembler, go,
    haskell, haxe, html, idris, ini, isocpp, java, javadoc, javascript, json,
    jsp, julia, kotlin, latex, lex, lilypond, literatecurry, literatehaskell,
    llvm, lua, m4, makefile, mandoc, markdown, mathematica, matlab, maxima,
    mediawiki, metafont, mips, modelines, modula2, modula3, monobasic, nasm,
    noweb, objectivec, objectivecpp, ocaml, octave, opencl, pascal, perl, php,
    pike, postscript, prolog, pure, python, r, relaxng, relaxngcompact, rest,
    rhtml, roff, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql,
    sqlpostgresql, tcl, tcsh, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
    yacc, yaml, zsh
Default user data directory: /home/james/.pandoc
Copyright (C) 2006-2015 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

My version was taken directly from the ubuntu 16.04 dep cache.