jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.87k stars 3.34k forks source link

Conversion from wikimarkup to LaTeX more or less unusable #804

Closed molly closed 11 years ago

molly commented 11 years ago

I've been testing pandoc (1.11.1) on a number of Wikipedia articles to determine if we can use it to perform some conversions. I'm interested in converting from wikimarkup <-> LaTeX and from HTML (output by Parsoid) <-> LaTeX.

Granted, the articles I've been testing are generally "challenging" (lots of templates, tables, odd syntax, etc.), but I've been unable to get much for a successful output. Am I missing a flag or something, or is it just that the wikimarkup and/or TeX is in its early stages?

I've been keeping notes on the tests here.

jgm commented 11 years ago

Mediawiki parsing is of course a complex task. Pandoc's support is incomplete. Some things, e.g. template parsing or tables within tables, will probably never be supported. Pandoc converts everything to a simplified intermediate document model which can't represent all of the features mediawiki supports.

It works well for simple mediawiki pages, e.g. most of the pages in the Haskell wiki. I don't expect it to work well for complex wikipedia pages. The mediawiki reader is fairly new and has not been extensively tested. You can help improve it by reporting specific issues on this tracker -- one bug report per issue, please.

Some of the things that you find to be missing are things pandoc can't parse, so it stores them as raw mediawiki markup in the AST. This is just omitted in HTML output, but it could in principle be processed by a script and converted into something more useful. (That is how I envision templates being handled.) THe script would be easiest to write in Haskell, but since pandoc can export json, any programming language could be used.

Images aren't included in articles. The documentation suggests that images will be downloaded if the standalone flag is set, but they are not. LaTeX attempts to find them in the directory from which it's building, and when it's unable to do so, the build fails.

As the documentation indicates, --self-contained currently only works for HTML output. It creates 'data:' URIs. There's nothing comparable for LaTeX. IT would not be too hard to add the automatic downloading for PDF output in the future. For now, you'll have to download the images yourself, something you could script.

Many accented/special characters aren't recognized.

Examples? Are the pages UTF-8 encoded? (That is a documented requirement for pandoc.)

footnotes themselves incomplete or empty

Probably because they rely heavily on templates for citations etc.

Spaces before IBX params causes pandoc to fail completely.

Please give an example allowing me to reproduce this.

Misplaced \noalign, \cr, & all over the place.

Example would be nice. Pandoc won't be able to parse all mw tables, but it shouldn't produce bad LaTeX. What version of pandoc are you using, btw?

Links are sometimes split between two lines, causing the \href{} command to break.

Example?

The tags look lovely—they even have syntax highlighting! Leading spaces are ignored completely, which is very confusing.

Example?

raw url is very broken, leaving the URL and some fragments of wikimarkup in the output.

What should the output be?

|-style=... causes the pandoc build to fail.

Probably not hard to have pandoc ignore this.

Chinese characters

You need to use xelatex and a font that has the Chinese glyphs for these to work well in LaTeX.

+++ Molly White [Mar 26 13 10:50 ]:

I've been testing pandoc on a number of Wikipedia articles to determine if we can use it to perform some conversions. I'm interested in converting from wikimarkup <-> LaTeX and from HTML (output by [1]Parsoid) <-> LaTeX.

Granted, the articles I've been testing are generally "challenging" (lots of templates, tables, odd syntax, etc.), but I've been unable to get much for a successful output. Am I missing a flag or something, or is it just that the wikimarkup and/or TeX is in its early stages?

I've been keeping notes on the tests [2]here.

Thanks a lot :)

-- Reply to this email directly or [3]view it on GitHub. [xJAuenYDiIoVt3LF3y684_GKN6czt7-FhtAN3CYZXuzIFs3TIeOa1BuNiuIMT8HY.gif]

References

  1. http://www.mediawiki.org/wiki/Parsoid
  2. http://www.mediawiki.org/wiki/User:GorillaWarfare/pandoc
  3. https://github.com/jgm/pandoc/issues/804
molly commented 11 years ago

You're right regarding UTF-8. The first few pages were not UTF-8 encoded, which is why the Acetic acid page was causing issues there. Everything from List of fictional doctors and on is UTF-8 encoded, which I think solved the character issues (except for the Chinese, obviously).

You're right about the citations as well -- it appears that the ones that are missing are the ones that use {{cite}} templates and such. Unfortunately, that's quite a few pages.

Regarding the spaces before infobox parameters, see the wikitext here. When you try to convert that to LaTeX, pandoc will complain about an unexpected " ".

Regarding misplaced \noalign, \cr, etc., that is caused by converting the wikitext at List of fictional doctors. This text is also what causes the \href tags to break. I'm using pandoc 1.11.1.

The leading spaces issue is caused by text from Lazy evaluation. You can see how Wikipedia handles this same text here.

See my comments on raw URLs here.

Thanks a lot for your comprehensive response! :)

jgm commented 11 years ago

Thanks for this. Infoboxes are wikipedia-specific, and hence not supported. But I might support them. Why, out of interest, do you surround the two infoboxes with {| and |} (which I thought was for tables)?

jgm commented 11 years ago

The <ref>URL</ref> problem was a regression. Thanks for calling it to my attention. Fixed in 099b4b776985e23bffb06b3dca3a697d3fde2a41.

molly commented 11 years ago

That infobox is somewhat unique, in that the two templates combine to create a table. They have to be wrapped in {|and |}, or else it doesn't know it's a table and you end up with this.

jgm commented 11 years ago

OK, I've fixed parsing of preformatted indented blocks, so they no longer require preceding and following blank lines, as they did before. The lazy evaluation text converts well now.

jgm commented 11 years ago

I've improved the table parser. Now the list of doctors produces working latex, except for two things: hard line breaks seem not to work inside \href{..} (Dr. Doolittle), and a special accented o doesn't work with pdflatex.

There's also a problem with too-wide column widths, which needs to be looked into.

jgm commented 11 years ago

TODO:

molly commented 11 years ago

Very cool! I'm hoping to take a deeper look into the pandoc source code this weekend and maybe see if I can help out a bit with the wikimarkup. No promises on that front, though—I'm a Python/C++ girl, and have never used Haskell or similar.

jgm commented 11 years ago

I've isolated the problem with Dr Dolittle in the table. It can be reproduced with this latex:

\begin{longtable}{ll}
\emph{\href{The Story of Doctor Dolittle}{The Story of Doctor
Dolittle:\\Being the History of His Peculiar Life at Home\\and
Astonishing Adventures in Foreign Parts Never Before Printed}} & eek\\
\end{longtable}

The problem is the \\, while okay in general in a href, causes problems inside a table cell, since \\ has a special role there.

jgm commented 11 years ago

Details on the problem here: http://tex.stackexchange.com/questions/2441/how-to-add-a-forced-line-break-inside-a-table-cell

Unfortunately, it's a bit hard to solve in a nice, general way. Since we're already looking at alternative table packages, we might want to think about this problem in that context.

molly commented 11 years ago

I've been using the tabularx package for my personal parsing project for its variable-width columns, and I believe it also allows \newline in table cells without starting a new row.

jgm commented 11 years ago

+++ Molly White [Mar 28 13 15:04 ]:

I've been using the tabularx package for [1]my personal parsing project for its variable-width columns, and I believe it also allows /newline in table cells without starting a new row.

If I had column widths, pandoc would have no trouble rendering the tables even with the current LaTeX writer. The trouble is, what to do for a table where no width information is given -- either for the whole table or for the columns. tabularx needs to know the whole table's width. If you set it to textwidth, it will look very bad for many simple tables.

DirkHunniger commented 11 years ago

I made my own one since I think the internal tree representation in pandoc is not strong enough for that problem. I used the same language and parsing library as pandoc. It is avialable for both linux and windows

http://de.wikibooks.org/wiki/Benutzer:Dirk_Huenniger/wb2pdf

jgm commented 11 years ago

I'm going to close this, as I've addressed the issues I think I can address. Feel free to open further issues for more specific problems that arise.