jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.62k stars 3.38k forks source link

Does not cope well with fancy punctuation. #628

Closed pete-mckinney closed 12 years ago

pete-mckinney commented 12 years ago

I have some source material that I'm trying to convert into an epub. It uses punctuation that is giving pandoc fits.

Here is a sample:

# Chapter 1
This is some text.  Doesn’t like an em dash — does not.
“Also does not like fancy quotes.”

This gives the following error:

pandoc: test2.txt: hGetContents: invalid argument (invalid byte sequence)

It would be nice if pandoc would consume this. It would also be nice if pandoc would give the line number and column of the data that it considers invalid.

Thanks!

jgm commented 12 years ago

This error means that your input is not UTF-8 encoded. It has nothing to do with the specific content.

Use your text editor, or iconv or another tool to convert the text to UTF-8.

The next version of pandoc will have a more informative error message for this!

+++ pete-mckinney [Sep 27 12 06:57 ]:

I have some source material that I'm trying to convert into an epub. It uses punctuation that is giving pandoc fits.

Here is a sample:

Chapter 1

This is some text. Doesn't like an em dash -- does not. "Also does not like fancy quotes."

This gives the following error:

 pandoc: test2.txt: hGetContents: invalid argument (invalid byte
 sequence)

It would be nice if pandoc would consume this. It would also be nice if pandoc would give the line number and column of the data that it considers invalid.

Thanks!

-- Reply to this email directly or [1]view it on GitHub. [J6T91GIPIyhU-8ti4GCGP7AlC2fiocPKodp06RQqyLwCg2EfXx-FUz_KkN6q41LD.gif]

References

  1. https://github.com/jgm/pandoc/issues/628
pete-mckinney commented 12 years ago

That did the trick- thanks!

On Thu, Sep 27, 2012 at 11:57 AM, John MacFarlane notifications@github.comwrote:

This error means that your input is not UTF-8 encoded. It has nothing to do with the specific content.

Use your text editor, or iconv or another tool to convert the text to UTF-8.

The next version of pandoc will have a more informative error message for this!

+++ pete-mckinney [Sep 27 12 06:57 ]:

I have some source material that I'm trying to convert into an epub. It uses punctuation that is giving pandoc fits.

Here is a sample:

Chapter 1

This is some text. Doesn't like an em dash -- does not. "Also does not like fancy quotes."

This gives the following error:

pandoc: test2.txt: hGetContents: invalid argument (invalid byte sequence)

It would be nice if pandoc would consume this. It would also be nice if pandoc would give the line number and column of the data that it considers invalid.

Thanks!

Reply to this email directly or [1]view it on GitHub. [J6T91GIPIyhU-8ti4GCGP7AlC2fiocPKodp06RQqyLwCg2EfXx-FUz_KkN6q41LD.gif]

References

  1. https://github.com/jgm/pandoc/issues/628

— Reply to this email directly or view it on GitHubhttps://github.com/jgm/pandoc/issues/628#issuecomment-8941234.

Blackhawke commented 10 years ago

I know this bug is a couple of years old, but I'm having the same problem and iconv doesn't fix it. Even more interesting is that the document I'm getting this error is one that pandoc itself created! I took a LaTeX document, converted to ODT, MD, DOCX, and a few other formats for testing, and then tried to convert them back into LaTeX. Nogo! Every format, all produced by pandoc itself, exits with the same error:

pandoc: : hGetContents: invalid argument (invalid byte sequence)

(Note, the "" reference is there because in this instance, I filtered the input file through iconv first. Didn't change anything! Same results with or without iconv.

Ideas folks?

(Note: Running Ubuntu 12.04lts)

mpickering commented 10 years ago

Please can you post your input if possible? If not then a minimum example which highlights the problem?

Making a new issue wouldn't be a bad idea either.

jgm commented 10 years ago

Also what do

Pandoc --version

and

locale

report?

On Aug 13, 2014, at 2:57 PM, mpickering notifications@github.com wrote:

Please can you post your input if possible? If not then a minimum example which highlights the problem?

— Reply to this email directly or view it on GitHub.

Blackhawke commented 10 years ago

Thanks for getting back to me so promptly John.

First, let me correct the record. I mistakenly included markdown in my list, and should not have. Pandoc reverse processed the md formatted file just fine. Moving on...

Okay... 'pandoc --version' returns:

pandoc 1.9.1.1

Compiled with citeproc-hs 0.3.4, texmath 0.6.0.3, highlighting-kate

0.5.0.5.

Syntax highlighting is supported for the following languages:

Actionscript, Ada, Alert, Alert_indent, Apache, Asn1, Asp, Awk, Bash,

Bibtex, Boo, C, Changelog, Clojure, Cmake, Coffeescript, Coldfusion,

Commonlisp, Cpp, Cs, Css, D, Diff, Djangotemplate, Doxygen, Dtd,

Eiffel,

Email, Erlang, Fortran, Fsharp, Gnuassembler, Go, Haskell, Haxe, Html,

Ini,

Java, Javadoc, Javascript, Json, Jsp, Latex, Lex, LiterateHaskell, Lua,

Makefile, Mandoc, Matlab, Maxima, Metafont, Mips, Modula2, Modula3,

Monobasic, Nasm, Noweb, Objectivec, Objectivecpp, Ocaml, Octave,

Pascal,

Perl, Php, Pike, Postscript, Prolog, Python, R, Relaxngcompact, Rhtml,

Ruby,

Scala, Scheme, Sci, Sed, Sgml, Sql, SqlMysql, SqlPostgresql, Tcl,

Texinfo,

Verilog, Vhdl, Xml, Xorg, Xslt, Xul, Yacc, Yaml

Copyright (C) 2006-2012 John MacFarlane

Web: http://johnmacfarlane.net/pandoc

This is free software; see the source for copying conditions. There is no

warranty, not even for merchantability or fitness for a particular purpose.

This is, I hope the latest version, as it's the one Canonical has up. :)

locale reports:

LANG=en_US.UTF-8

LANGUAGE=

LC_CTYPE="en_US.UTF-8"

LC_NUMERIC="en_US.UTF-8"

LC_TIME="en_US.UTF-8"

LC_COLLATE="en_US.UTF-8"

LC_MONETARY="en_US.UTF-8"

LC_MESSAGES="en_US.UTF-8"

LC_PAPER="en_US.UTF-8"

LC_NAME="en_US.UTF-8"

LC_ADDRESS="en_US.UTF-8"

LC_TELEPHONE="en_US.UTF-8"

LC_MEASUREMENT="en_US.UTF-8"

LC_IDENTIFICATION="en_US.UTF-8"

LC_ALL=

As it should (unless someone has been mucking with my system settings!).

As for posting my input, the LaTeX file is a book length manuscript, so let me create a smaller file and put that up here in a bit.

mpickering commented 10 years ago

The latest version is 1.12.4! and we're going to release 1.13 in the very very near future. Maybe try with an updated version first?

Blackhawke commented 10 years ago

Yeah I'm serious! WTF is up with Ubuntu and Canonical?!?!?

Would you happen to have a URL for a private repo I can add for apt to get your updates automagically?

-=Michael=- Metaphor Publications


Add me to your address book: http://ourteam.com/mjmatson


On Thu, Aug 14, 2014 at 11:26 AM, mpickering notifications@github.com wrote:

The latest version is 1.12.4! and we're going to release 1.13 in the very very near future. Maybe try with an updated version first?

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/628#issuecomment-52222440.

Blackhawke commented 10 years ago

Been doing some Google cruzing. I haven't found an apt repo yet, but I have found evidence that Ubuntu isn't doing a very good job at all of keeping pandoc up to date: http://stackoverflow.com/questions/24863160/trouble-with-pandoc-installation-on-ubuntu-14-04lts-for-using-with-r-markdown

(Note: I'm running 12.04lts, not 14.04lts (we want to finish this book series before upgrading), but IMO that only makes the above problem worse, as 14.04 was just released! It should have the latest version of your program!

-=Michael=- Metaphor Publications


Add me to your address book: http://ourteam.com/mjmatson


On Thu, Aug 14, 2014 at 11:26 AM, mpickering notifications@github.com wrote:

The latest version is 1.12.4! and we're going to release 1.13 in the very very near future. Maybe try with an updated version first?

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/628#issuecomment-52222440.