jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.68k stars 3.39k forks source link

docx writer does not output chart tag form docx file like a image tag. #3221

Closed manoj-compro closed 8 years ago

manoj-compro commented 8 years ago

Using following command to convert a docx file in to json/html

pandoc -S "./Standard.docx"  -t json -o "./output.json"
pandoc -S "./Standard.docx"  -t html -o "./output.html"

Did not find any tag in json/html file corresponding to chart element in docx file. Please find the attached docx file that i am using. input.docx

jkr commented 8 years ago

What sort of output would you imagine that this would produce? It doesn't really correspond to anything in Pandoc.

manoj-compro commented 8 years ago

For Example If Docx file contains a picture then an image tag is in output JSON. But there is no tag corresponding to a chart in output json.

manoj-compro commented 8 years ago

It could be an issue with docx reader as it is not reading the chart element in docx file.

jgm commented 8 years ago

I believe charts in word documents use an entirely distinct XML format that describes the chart. I think dealing with this is out of the scope of pandoc. (Pandoc can't be expected to contain a complete image renderer for Microsoft Chart XML.) I think the best we can do would probably be to insert some sort of placeholder like "[CHART]", but let's see what @jkr says, he's the export on the docx reader.

+++ Manoj Saini [Nov 09 16 22:42 ]:

It could be an issue with docx reader as it is not reading the chart element in docx file.

jkr commented 8 years ago

It's not rendering it as an image because it's not an image. It's a chart, with its own (point-based) xml language for describing it. If we wanted to turn it into an image, we'd have to render it, which is outside the scope of pandoc. I could take a look at https://hackage.haskell.org/package/Chart. But @jgm, would I be right in thinking that lens is too much to pull in for such an edge use?

jgm commented 8 years ago

See http://officeopenxml.com/drwOverview.php

jkr commented 8 years ago

The only other place-holder option would be to see if there was some consistent way to pull the point-data out, and present it in some sort of spreadsheet form, as a pandoc table.

jgm commented 8 years ago

+++ Jesse Rosenthal [Nov 10 16 02:56 ]:

It's not rendering it as an image because it's not an image. It's a chart, with its own (point-based) xml language for describing it. We'd have to render it, which is outside the scope of pandoc. If we wanted to turn it into an image, we'd have to render it. I could take a look at [1]https://hackage.haskell.org/package/Chart. But [2]@jgm, would I be right in thinking that lens is too much to pull in for such an edge use?

Well, it might be inevitable to pull in lens eventually, since more and more packages depend on it.

However, it seems to me that creating a correct renderer for MS jobs would be a huge job. Do you really think it's feasible?

jkr commented 8 years ago

However, it seems to me that creating a correct renderer for MS jobs would be a huge job. Do you really think it's feasible?

Depends on the extent of the spec, which I haven't really been able to hunt down. The example posted above is fairly straightforward: (x,y) points and description of the line through them. I'd probably have to just create a bunch to get a feel for what it does. It seems like there are haskell libs that would be up to the challenge.

But using any of those libraries would also mean that we'd need gtk2hs and cairo (or something similar), which is heavy and in my limited experience, the source of hard-to-track bugs.

What I think might be more useful than a renderer would be a converter, either in texmath or in its own library. So rather than render, we could produce tikz or the like, and let LaTeX do the rendering. And vice versa. To html5 canvas? Maybe gnuplot as the intermediate form?

Certainly seems more complex the more I think about it, but it would be pretty cool too.

jkr commented 8 years ago

But the more I think about it, the more it seems like the most efficient and flexible thing we could do is to output some sort of data table, and let filters do with it as they will.

In any case, I'll investigate. In the meantime, do you think it makes more sense to have a visible [CHART] or an invisible empty div: <div class="chart"></div>?

jgm commented 8 years ago

Maybe both?

<div class="chart">[CHART]</div>

That gives you the opportunity to do something with it in a filter, while also giving a visual indication that something was not rendered.

jkr commented 8 years ago

Okay, I put in a change doing the above. It's a span instead of a div, since drawing elems are part of a paragraph. If we think it makes more sense to be a div, we might have to add a post-processing step to convert it.

I'd like to keep the discussion open, though, a more useful way to deal with chart data. It looks like every chart in a word document refers back to an excel spreadsheet, though sometimes that spreadsheet isn't available anymore. So we could look at the possible sorts of data tables that build the (relatively few) sorts of charts, and output those, , maybe wrapped in some sort of <div chart="pie-chart"> tag.