jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.19k stars 3.36k forks source link

Add AsciiDoc Reader / AsciiDoc input support #1456

Open ERnsTL opened 10 years ago

ERnsTL commented 10 years ago

Greetings,

I would like to hereby suggest the addition of AsciiDoc input resp. an AsciiDoc Reader.

Besides Markdown, this format is growing in popularity, also in use inside a publishing toolchain (asciidoc -> docbook -> pdf/epub/html). Currently the only other viable implementation is asciidoctor, which uses Ruby or JRuby, but it is AsciiDoc-only in its input format and not a universal markup converter, like pandoc.

I am aware of only one relevant discussion thread regarding this, which showed positive echo for this feature. Someone actually had some basic code there, but I am not sure if kuznero resp. Roman Kuznetsov still has his code from back then to start from, but anyway, it would certainly make sense to have AsciiDoc input in the feature set.

mpickering commented 10 years ago

I have started working on this, it might not be ready for a few months.

mpickering commented 10 years ago

Have you tried going AsciiDoc -> DocBook -> Pandoc? Can you describe the shortcomings of doing this if you have?

ERnsTL commented 10 years ago

Thanks for your positive comment!

I personally have not tried the conversion chain as you mentioned. The current choices are bulky with regards to its dependencies, while not being and not aspiring to be universal markup translators, like pandoc. Going through intermediate formats instead of one pandoc invocation seems hacky to me.

Recently, I read about two publishing houses switching away from LaTeX and moving to AsciiDoc as their source format, so I gather that it fulfills the needs of technical writing well, also regarding referencing and I find it useful to have another capable plain-text format for documents, if software support is good and can easily convert between different formats and offers multiple output choices. Which is where this feature would come in ;-)

I personally would also like to write articles and possibly a book rather in AsciiDoc than LaTeX, which is - at least my personal - motivation for this feature.

alexborisov commented 10 years ago

:+1: I would love to see support for reading asciidoc files in pandoc. It was actually my primary reason to use pandoc. At the moment i have a bit of a hacky solution converting my asciidoc into html and then feeding that into pandoc. I would love to eliminate the extra project dependency and simplify my build chain.

ciampix commented 9 years ago

I can testify that this would be a very useful addition to pandoc, thanks in advance mr. mpickering!

jgm commented 9 years ago

PR #2100 contributes a basic AsciiDoc reader (with many features not yet implemented). @mpickering, how far did you get on your AsciiDoc reader? Is it farther along than #2100, or not as far along? It would be good to put something in the repository (in a branch) for people to work on to advance the project.

mpickering commented 9 years ago

I commented on the #2100

romario89 commented 9 years ago

I agree to that asciidoc is getting popular. I've just tried to convert the book Pro Git 2 into epub using pandoc and soon noticed that the book was coded in asciidoc, and pandoc was unable to read it.

benhourigan commented 9 years ago

I’ve tried an asciidoc > html (via https://github.com/asciidoctor/asciidoctor) > epub (via pandoc) conversion chain and it works extremely well except for the following issue.

Asciidoctor wraps all HTML elements in divs with additional classes. This stops pandoc from splitting the epub automatically at headings, because it will never see a 'naked' h1 etc.

Also mentioned this issue here: https://github.com/asciidoctor/asciidoctor/issues/184

The ePub file I created using this method did not pass epubcheck 3.0.1 because it contained one duplicate ID (something I could have avoided). More seriously, the way footnotes are handled is not compliant, and raised several errors like this:

ERROR: …/EPUB.epub/ch003.xhtml(2918,1223): '_footnote_4': fragment identifier is not defined in 'ch003.xhtml'
ERROR: …/EPUB.epub/ch008.xhtml(2065,27): '_footnoteref_1': fragment identifier is not defined in 'ch008.xhtml'

(paths truncated with ellipsis at start)

At the risk of stating the obvious, it's important that pandoc-generated epubs from any source format avoid epubcheck validation errors, as authors and publishers may need to submit these epubs to storefronts that will require epubcheck compliance (i.e. Smashwords, iBooks). Many of the Github-hosted CLI epub generators I've tried (e.g. https://github.com/avdgaag/rpub) omit consideration of epubcheck compliance, so this may not be an obvious point after all. The most common point of failure seems to be the manifest, which pandoc does correctly, which is great. But it could be yet more robust, as the above errors indicate.

jgm commented 9 years ago

+++ Ben Hourigan [Jul 05 15 08:56 ]:

I’ve tried an asciidoc > html (via [1]https://github.com/asciidoctor/asciidoctor) > epub (via pandoc) conversion chain and it works extremely well except for the following issue.

Asciidoctor wraps all HTML elements in divs with additional classes. This stops pandoc from splitting the epub automatically at headings, because it will never see a 'naked' h1 etc.

Also discussed this issue here: [2]asciidoctor/asciidoctor#184

You could handle this easily with a filter that strips out the outer Divs before the EPUB writer sees it.

The ePub file I created using this method did not pass epubcheck 3.0.1 because it contained one duplicate ID (something I could have avoided) and, more seriously, the way footnotes are handled is not compliant, and raised several errors like this: ERROR: …/EPUB.epub/ch003.xhtml(2918,1223): '_footnote_4': fragment identifier is not defined in 'ch003.xhtml' ERROR: …/EPUB.epub/ch008.xhtml(2065,27): '_footnoteref_1': fragment identifier i s not defined in 'ch008.xhtml'

When I convert pandoc's README to epub3, I see no errors with epubcheck 3.0.1 (and README has several footnotes).

My guess is that the HTML footnotes produced by asciidoctor are not read by pandoc as native pandoc footnotes, and that is the underlying issue.

If you attach a short sample file (of HTML produced by asciidoctor), we could confirm that.

Unfortunately, there's no standard way of doing footnotes in HTML, so the HTML reader never produces a Note element.

benhourigan commented 9 years ago

Hope this is a sufficient sample:

<div class="paragraph">
<p>… the kind of politics that the liberal economist F. A. Hayek called &#8220;socialist.&#8221; <span class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</span></p>
</div>
jgm commented 9 years ago

Could you attach or link to the generated (noncompliant) epub itself?

jgm commented 9 years ago

By the way, here's a simple filter (undiv.hs) that will remove your content divs. Run with --filter undiv.hs:

import Text.Pandoc.JSON

main = toJSONFilter undiv
  where undiv (Div (ident, ["content"], kvs) bs) = bs
        undiv b = [b]
jgm commented 9 years ago

Depending on how asciidoc formats the notes, you may be able to get the HTML reader to parse them as notes. If you use -f html+epub_html_exts, then pandoc will interpret an element with the type attribute set to footnote or rearnote as a note, and an element with the type attribute set to noteref as a note reference, where the href attribute is an internal link to the corresponding footnote or rearnote. It looks as if asciidoc doesn't quite do it that way, but you could use a filter to add the needed type attributes, and then you'd be there.

jgm commented 9 years ago

you could use a filter to add the needed type attributes, and then you'd be there.

Sorry, this is a bit misleading. Since a filter is applied only after the HTML reader, this wouldn't work unless you first filtered, then piped the resulting HTML into another invocation of pandoc. Anyway, there are numerous tools you could use to insert the type attribute where it's needed in the HTML, before passing to pandoc.

jgm commented 9 years ago

Or maybe asciidoctor could be persuaded to insert the needed type attributes in the HTML.

jgm commented 9 years ago

Actually, rather than the epub itself, it would be most useful for me to have the HTML from which it was generated.

benhourigan commented 9 years ago

Thanks for the filter. Will try this out. You can get the HTML file from https://www.dropbox.com/s/c2ror63pz16hc3w/2015-07-06-BH-STG-adoc-test.html?dl=0

jgm commented 9 years ago

I tried:

% pandoc adoc-test.html -t epub3 -o adoc.epub
% epubcheck adoc.epub
Epubcheck Version 3.0.1

Validating against EPUB version 3.0
ERROR: adoc.epub: could not parse ch006.xhtml: duplicate id: cracks

Check finished with warnings or errors

So I edited adoc-test.html and changed one of the duplicate cracks ids to cracks2. I then regenerated the epub using pandoc and epubcheck gave no validation errors. Are you using the latest version of pandoc?

jgm commented 9 years ago

PS. You might have more success using asciidoc to produce DocBook, then converting that with pandoc. Have you tried that route?

benhourigan commented 9 years ago

Ah, damn---I didn't think to do a version check. Sorry for being such a novice. I'd been using 1.13.2, which is the latest version on homebrew. Will install 1.15 and try again. Your results sound promising.

benhourigan commented 9 years ago

BTW, as of pandoc 1.13.2, when I tried the asciidoctor docbook > pandoc epub route the output from docbook was inferior to the output from HTML. One particular thing that I noticed was that admonition blocks came in to the epub as blockquotes without an additional class, and so couldn't be styled specifically with CSS.

jgm commented 9 years ago

Currently DocBook elements like <important>, <caution>, <note>, <tip> ar rendered as a block quote starting with a single paragraph with the word "Important", "Caution", "Note", or "Tip" in strong. I'm not sure this is ideal; we could switch to using divs. However, even with the present setup, it would be simple to intercept these block quotes in a filter and change them to divs. Just look for

BlockQuote (Para [Strong [Str "Important"]] : xs)

and convert that to

Div ("", ["admonition"], []) xs

+++ Ben Hourigan [Jul 06 15 10:34 ]:

BTW, as of pandoc 1.13.2, when I tried the asciidoctor docbook > pandoc epub route the output from docbook was inferior to the output from HTML. One particular thing that I noticed was that admonition blocks came in to the epub as blockquotes without an additional class, and so couldn't be styled specifically with CSS.

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/1456#issuecomment-118934262
benhourigan commented 9 years ago

With the duplicate ID sorted in the .adoc file, from the pandoc 1.15 produces from the asciidoctor.html an epub that passes ePubcheck! :)

Inability to split the file at chapter headings remains an issue. Now trying the undiv filter. At first I got this error:

undiv.hs: createProcess: runInteractiveProcess: exec: does not exist (No such file or directory)

Then after installing ghc (7.8.4) and cabal-install (1.22.0.0) (not sure if latter was necessary) from homebrew, I got this error.

pandoc: Error running filter undiv.hs
fd:4: hPutBuf: resource vanished (Broken pipe)

Oddly, if I run chmod +x undiv.hs, I go back to getting the first error again.

All files used during generation here: https://www.dropbox.com/s/40uv3f4ad2fwuco/bh-undiv.hs-test.zip?dl=0 I'm running a script called generate.sh, which is just the following command:

pandoc -t epub --filter undiv.hs --epub-cover-image=cover.png --epub-stylesheet=epub.css --epub-metadata=metadata.xml --epub-chapter-level=1 -o EPUB.epub TEST-v32-Justin-Comments-In.html
jgm commented 9 years ago

Is undiv.hs in your working directory?

What OS are you on?

+++ Ben Hourigan [Jul 06 15 18:13 ]:

With the duplicate ID sorted in the .adoc file, from the pandoc 1.15 produces from the asciidoctor.html an epub that passes ePubcheck! :)

Inability to split the file at chapter headings remains an issue. Now trying the undiv filter. At first I got this error: undiv.hs: createProcess: runInteractiveProcess: exec: does not exist (No such fi le or directory)

Then after installing ghc (7.8.4) and cabal-install (1.22.0.0) (not sure if latter was necessary) from homebrew, I got this error. pandoc: Error running filter undiv.hs fd:4: hPutBuf: resource vanished (Broken pipe)

Oddly, if I run chmod +x undiv.hs, I go back to getting the first error again.

All files used during generation here: [1]https://www.dropbox.com/s/40uv3f4ad2fwuco/bh-undiv.hs-test.zip?dl=0 I'm running a script called generate.sh, which is just the following command: pandoc -t epub --filter undiv.hs --epub-cover-image=cover.png --epub-stylesheet= epub.css --epub-metadata=metadata.xml --epub-chapter-level=1 -o EPUB.epub TEST-v 32-Justin-Comments-In.html

— Reply to this email directly or [2]view it on GitHub.

References

  1. https://www.dropbox.com/s/40uv3f4ad2fwuco/bh-undiv.hs-test.zip?dl=0
  2. https://github.com/jgm/pandoc/issues/1456#issuecomment-119039821
benhourigan commented 9 years ago

It is in the working directory, yes. On Mac OS X 10.10.4. Using haskell from ghc installed via homebrew.

Am just about to install the haskell platform from https://www.haskell.org/platform/download/2014.2.0.0/Haskell%20Platform%202014.2.0.0%2064bit.signed.pkg to see if that helps.

benhourigan commented 9 years ago

Hmm, installing that version of haskell and running activate-hs to go back to ghc 7.8.3 didn't change anything.

jgm commented 9 years ago

This shouldn't be necessary, but try adding a shebang line to the top

#!/usr/bin/env runhaskell

and chmod +x. Then invoke with --filter ./div.hs.

benhourigan commented 9 years ago

Thanks! Making progress, perhaps. Made those changes and undiv.hs now reads:

#!/usr/bin/env runhaskell
import Text.Pandoc.JSON

main = toJSONFilter undiv
  where undiv (Div (ident, ["content"], kvs) bs) = bs
        undiv b = [b]

I believe the script is now being located correctly. The error I now get is:

pandoc: Error running filter ./undiv.hs
fd:4: hPutBuf: resource vanished (Broken pipe)

runhaskell --version returns runghc 7.8.4. Pandoc is 1.15.

Could this be something to do with the content of the HTML input, or it something else?

Sorry to bother you with all this—I'm just keen to see if it works.

jgm commented 9 years ago

Try running the filter as a regular pipe:

pandoc -t json -s TEST-v32-Justin-Commens-In.html |  runhaskell undiv.hs | pandoc -f json -s -t epub --epub-cover-image=cover.png --epub-stylesheet=epub.css --epub-metadata=metadata.xml --epub-chapter-level=1 -o EPUB.epub

This should give you better error reporting.

adeluccar commented 9 years ago

I haven't had many issues going the asciidoctor > docbook > pandoc > epub way—for now. @mpickering Thank you for that suggestion above.

zaxebo1 commented 9 years ago

@jgm in your comment at https://github.com/jgm/pandoc/issues/1456#issuecomment-118644600 , you stated that

Or maybe asciidoctor could be persuaded to insert the needed type attributes in the HTML.


Can you give me some url link where i can find list of ALL the type attributes expected by HTML input reader of pandoc. So that we can enhance/recheck the html output of asciidoctor at once , upto pandoc html input expectations.

jgm commented 9 years ago

@zaxebo1 - see the comment two above the comment you link to.

I'm just going on a glance at the source code here. You should try it first by manually modifying the HTML and running it through

pandoc -f html+epub_html_exts

to make sure this works. I think it should, looking at the source, but we may need some adjustments. (And, I'm open to making this feature not require +epub_html_exts -- it seems pretty harmless to turn this kind of footnote detection on for all HTML parsing.)

zaxebo1 commented 9 years ago

Thanks for the reply. Probably i did not conveyed my question properly in my earlier comment.

what I meant really was not just these two "types". I was requesting with reference of :

As I have no haskell skills, so I effectively meant that - probably within this https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/HTML.hs and https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/EPUB.hs source code , **how do i know which are other "type" expected by html reader (apart from already discussed footnote and noteref ) ? That is, is there any array containing all the "type" expected by HTML reader? Do they derive from same class or Do they ..? I am just looking for some pointers.

By what pattern in the source code of HTML.hs , i should recognise any "type" expectations of HTML reader of pandoc . That is, how to recognise ALL the input "type" expectations of HTML reader , so that we can take a full overlook/review of "types" in asciidoctor html output (apart from already discussed footnote and noteref "types")

jgm commented 9 years ago

The only other uses of the type attribute (with epub extension enabled) are:

type="chapter" on "article", "aside", "nav", "section"

and

type="titlepage" on "p", "hr", "pre", "blockquote", "ol" , "ul", "li", "dl", "dt", "dt", "dd" , "figure", "figcaption", "div", or "main" (this basically just gets ignored by the reader).

I think the footnote ones are the only ones you really need to worry about.

+++ zaxebo1 [Jul 16 15 17:33 ]:

Thanks for the reply. Probably i did not conve

what I meant really was not just these two "types". I was requesting with reference of :

References

  1. https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/HTML.hs
  2. https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/EPUB.hs
  3. https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/HTML.hs
  4. https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/EPUB.hs
  5. https://github.com/jgm/pandoc/issues/1456#issuecomment-122137092
kastork commented 8 years ago

I can't tell from this thread of discussion -- Is it the plan for asciidoc to eventually become a direct input format possibility for pandoc?

I get the feeling from this thread that perhaps that idea has been abandoned and instead the decision is to ensure that a2x-generated html or docbook is handled correctly by pandoc.

zaxebo1 commented 8 years ago

@mpickering as on 1st Aug 2014, you had mentioned that you have started working on implementing support for asciidoc/asciidoctor input . As today in Feb 2016, May i humbly and eagerly request you whether you will like to share the progress in this direction? Hopefully, do you have something that we can use upto some extent?

stasberkov commented 8 years ago

I need this too!

kwlanham commented 8 years ago

AsciiDoc is great for technical writing, and it would be awesome if pandoc had a reader for asciidoc too.

mpickering commented 8 years ago

@zaxebo1

I am no longer working on this and don't plan to return to it.

zaxebo1 commented 8 years ago

ohh :sob: :sob: :sob: :sob: :sob: :sob: :cry: :disappointed:

tarleb commented 8 years ago

Just FYI: I started to work on huskydoc, an Haskell implementation of Asciidoc. It won't lead to an Asciidoc reader anytime soon, although I hope that it can be integrated into pandoc at some point in the future. The program currently can only output pandoc json format, so it can be used like this:

huskydoc test.adoc | pandoc -f json -t html

For now huskydoc is nothing but a personal learning project. It works okay, but it is missing many features, is full of bugs, contains questionable design choices, and uses experimental libraries. I'd love if people would try it and give me feedback, but I strongly advice against using it for anything important.

The best choice for handling Asciidoc with Pandoc is still to convert to docbook with asciidoctor and to feed the result into Pandoc:

asciidoctor -o - -b docbook your-file.adoc | pandoc -f docbook

EDIT: s/native/json/g

zaxebo1 commented 8 years ago

@tarleb :
thats really wonderful to know. kudos

jgm commented 8 years ago

@tarleb you might consider having it emit pandoc JSON. JSON serialization/deserialization is much faster than read/show, which is why I used it for filters.

tarleb commented 8 years ago

Thanks for the feedback @jgm. It's emitting JSON now.

zaxebo1 commented 8 years ago

when you are now emitting pandoc JSON, then why not integrate the huskydoc with pandoc itself?

tarleb commented 8 years ago

There are three reasons:

I am planning to address these issues over the next months. I'd like to polish the library some more before I start to address compatibility issues. Being able to experiment is part of the reason this project exists.

The result of using the huskydoc executable is identical to what would be produced if pandoc was calling the library directly, so the method described above is hopefully be acceptable for now.

hobson commented 7 years ago

+1

tajmone commented 7 years ago

This thread is very interesting. I'd also love to see pandoc support Asciidoc reader.

As for the issue of AsciiDoctor wrapping paragraphs in <div> tags, I remember that I had stumbled on the issue in the past and did some research. It's possible to workaround this by using custom templates:

84. Provide Custom Templates

Asciidoctor allows you to override the converter methods used to render almost any individual AsciiDoc element.

This can be easily achieved using HAML and asciidoctor-backends, and it allows to change how elements are rendered into HTML by targeting single elements.

For example, to change how paragraphs are formatted in Asciidoctor's final output, you only need to add a modified version of this single file:

https://github.com/asciidoctor/asciidoctor-backends/blob/master/haml/html5/block_paragraph.html.haml

Look inside it:

%div{:id=>@id, :class=>['paragraph', role]}>
  - if title?
    .title=title
  %p<=content

You only need to remove the %div... part and the divs wrapping paragraphs won't be rendered. Somewhow those divs where meant for cases where the paragraph had special attributes, but there is no conditional checking, so even when not required they are still there taking place.

Another bad thing about this divs is that they make CSS styling really annoying.

The cool thing about using custom templates and backends is that only the needed files that you actually put in your custom template folder will be used, for the missing files it will fallback on the default. So there is no need to re-implement the whole template system.

I wish I could share more info on how to do it, but I researched this quite a long time ago and my memory is not fresh on the issue.

It is sad though that the AsciiDoctor project has been stuck for so long on developement of Chunked (multi-page) HTML output feature --- looks like is on stall right now.

Asciidoc FX

Those interested in a quick way to convert Asciidoc to html without having to install dependencies )not even AsciiDoc/AsciiDoctor) should look into Asciidoc FX: it's a cross platform AsciiDoc editor (also available as standalone app) that can convert Asciidoc documents to standalone (and templated) html5 docs (includin syntax highlighting with Highlight.js):

http://www.asciidocfx.com/

It's a Java app that bundles with AsciiDoctor and DocBook (no HTML output support though!), plus other tools --- thus sparing you to have to install anything. And it doesn't conflict with any locally installed version of AsciiDoctor, Asciidoc, etc.

I use pandoc to convert to AsciiDoc and then with Asciidoc FX I just open and save as HTML5 --- and I get a fully standalone document, with a nice template.

Hope this might help....

jgm commented 7 years ago

+++ Tristano Ajmone [Mar 06 17 14:12 ]:

The cool thing about using custom templates and backends is that only the needed files that you actually put in your custom template folder will be used, for the missing files it will fallback on the default. So there is no need to re-implement the whole template system.

I implemented a very similar system in jgm/gitit using HStringTemplate. The ability to have reusable template parts in a common request in pandoc.