jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.2k stars 3.36k forks source link

Crazy conversions #6501

Open SarahSiani-IT98 opened 4 years ago

SarahSiani-IT98 commented 4 years ago

Hello, I have a word document which is strangely converted into html.

My text is this: cit_lit.docx

“We are all in the gutter, but some of us are looking at the stars” – Oscar Wilde, Lady Windermere’s Fan (1892) “We cast a shadow on something wherever we stand, and it is no good moving from place to place to save things; because the shadow always follows. Choose a place where you won’t do harm – yes, choose a place where you won’t do very much harm, and stand in it for all you are worth, facing the sunshine.” – E. M. Forster, A Room with a View (1908) “Life appears to me too short to be spent in nursing animosity or registering wrongs.” – Charlotte Brontë, Jane Eyre (1847) “It was a bright cold day in April, and the clocks were striking thirteen.” – George Orwell, 1984 (1949) “No one is useless in this world who lightens the burdens of another.” – Charles Dickens, Doctor Marigold (1874)

And it is converted this: <div data-custom-style="Testo citato"><p><span data-custom-style="Enfasi"><em>“We are all in the gutter, but some of us are looking at the stars”</em> – Oscar Wilde, Lady Windermere’s Fan (1892)</span></p></div> tag strong lost

<div data-custom-style="Testo citato"><p><span data-custom-style="Enfasi">“We cast a shadow on something wherever we stand, and it is no good moving from place to place to save things; because the shadow always follows. Choose a place where you won’t do harm – yes, choose a place where you won’t do very much harm, and stand in it for all you are worth, facing the sunshine.” – E. M. Forster, A Room with a View (1908)</span></p></div> both tags strong & em lost

<div data-custom-style="Testo citato"><p><span data-custom-style="Enfasi">“Life appears to me <em>too short</em> to be spent in nursing animosity or registering wrongs.” – Charlotte Brontë, Jane Eyre (1847)</span></p></div> both tags strong & em lost but "too short" sentence is tagged em (???)

<div data-custom-style="Testo citato"><p><span data-custom-style="Enfasi"><em><strong>“It was a bright cold day in April, and the clocks were striking thirteen.”</strong></em> – George Orwell, 1984 (1949)</span></p></div> All correct!

<div data-custom-style="Testo citato"><p><span data-custom-style="Enfasi"><em><strong>“No one is</strong></em> useless <em><strong>in this world who lightens the burdens of another.”</strong></em> – Charles Dickens, Doctor Marigold (1874)</span></p></div> "useless" word is without tag strong & em (WHY ?????)

This is text converted w/o styles (only George Orwell is right :D ):

<p><em>“We are all in the gutter, but some of us are looking at the stars”</em> – Oscar Wilde, Lady Windermere’s Fan (1892)</p>
<p><em>“We cast a shadow on something wherever we stand, and it is no good moving from place to place to save things; because the shadow always follows. Choose a place where you won’t do harm – yes, choose a place where you won’t do very much harm, and stand in it for all you are worth, facing the sunshine.”</em> – E. M. Forster, A Room with a View (1908)</p>
<p><em>“Life appears to me too short to be spent in nursing animosity or registering wrongs.”</em> – Charlotte Brontë, Jane Eyre (1847)</p>
<p><em><strong>“It was a bright cold day in April, and the clocks were striking thirteen.”</strong></em> – George Orwell, 1984 (1949)</p>
<p><em><strong>“No one is</strong> useless <strong>in this world who lightens the burdens of another.”</strong></em> – Charles Dickens, Doctor Marigold (1874)</p>

And I have many other cases like these. I'm going crazy! Can you help me out?

Thank you SS

P.S. Version 2.9.2.1 of pandoc I use osx but in university I have also tried with ubuntu

tarleb commented 4 years ago

This looks related to #6452.

E.g., here's the markup for E. M. Forster:

      <w:r>
        <w:rPr>
          <w:rStyle w:val="Enfasi" />
        </w:rPr>
        <w:t>We cast a shadow on something wherever we stand, and
        it is no good moving from place to place to save things;
        because the shadow always follows. Choose a place where you
        won’t do harm – yes, choose a place where you won’t do very
        much harm, and stand in it for all you are worth, facing
        the sunshine.”</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:rStyle w:val="Enfasi" />
          <w:b w:val="false" />
          <w:bCs w:val="false" />
          <w:i w:val="false" />
          <w:iCs w:val="false" />
        </w:rPr>
        <w:t xml:space="preserve">
 – E. M. Forster, A Room with a View (1908)
</w:t>
      </w:r>
mb21 commented 4 years ago

Possibly also related to #6499 ?

jgm commented 4 years ago

What command line are you using, exactly? Are you using the styles extension, e.g. -f docx+styles?

jgm commented 4 years ago

@jkr could probably shed more light on this, but it may be that use of custom styles is incompatible with regular styling.

jkr commented 4 years ago

Well, what seems to be going on in the document is this: it's getting it's bold from the paragraph style (Testocitato) and then turning it off explicitly in the character styles when it gets to the author. Along the way bold is sometimes explicitly set and other times not, so you get the different formats for Orwell and Dickens.

I'll try to look at why this is going on, but I didn't implement most of the recent work on styles -- did @lierdakil? Anyway, I'll try to poke around, but my schedule is a bit haywire these days, due to the lack of summer camps and whatnot.

niszet commented 4 years ago

I checked original Word file by Word application (in office 365 in Windows 10). Original docx file has sentenses with and without italics as an attached figure.

2nd and 3rd sentenses don't have italic. And too short in 3rd sentense has italic. And useless in 5th sentense don't have italic.

So, missing em tags in word level are caused by original docx file, I think.

@SarahSiani-IT98 san, could you check the file in your environment?

And, Custom style Enfasi is a character style and it enables italic. And, Testo citato is paragraph style and it enables bold and italic. So, from docx(+styles) -> md -> docx conversion, all characters are bold by paragraph style of Testo citato.

And, conversion without +styles option, pandoc -f docx cit_lit.docx -t html outputs followings.

<p><em>“We are all in the gutter, but some of us are looking at the stars”</em> – Oscar Wilde, Lady Windermere’s Fan (1892)</p>
<p><em>“We cast a shadow on something wherever we stand, and it is no good moving from place to place to save things; because the shadow always follows. Choose a place where you won’t do harm – yes, choose a place where you won’t do very much harm, and stand in it for all you are worth, facing the sunshine.”</em> – E. M. Forster, A Room with a View (1908)</p>
<p><em>“Life appears to me too short to be spent in nursing animosity or registering wrongs.”</em> – Charlotte Brontë, Jane Eyre (1847)</p>
<p><em><strong>“It was a bright cold day in April, and the clocks were striking thirteen.”</strong></em> – George Orwell, 1984 (1949)</p>
<p><em><strong>“No one is</strong> useless <strong>in this world who lightens the burdens of another.”</strong></em> – Charles Dickens, Doctor Marigold (1874)</p>

As jgm-san said, by using custom style, output is incompatible with regular styling...

@SarahSiani-IT98 san, If you enable italic and bold in character style Enfasi and disable italic and bold in paragraph style Testo citato, you will get expected docx and html.

The updated docx file is attached as an example.

cit_lit_2.docx

I use Pandoc 2.10.

niszet commented 4 years ago

I would like to add several notes. If you want to get <em> and <strong> tags explicitly, you should not use docx+styles, should use pandoc -f docx cit_lit_2.docx -t html. Because bold and italic are included in the style. The result is,

<p><em><strong>“We are all in the gutter, but some of us are looking at the stars”</strong></em> – Oscar Wilde, Lady Windermere’s Fan (1892)</p>
<p><em><strong>“We cast a shadow on something wherever we stand, and it is no good moving from place to place to save things; because the shadow always follows. Choose a place where you won’t do harm – yes, choose a place where you won’t do very much harm, and stand in it for all you are worth, facing the sunshine.”</strong></em> – E. M. Forster, A Room with a View (1908)</p>
<p><em><strong>“Life appears to me too short to be spent in nursing animosity or registering wrongs.”</strong></em> – Charlotte Brontë, Jane Eyre (1847)</p>
<p><em><strong>“It was a bright cold day in April, and the clocks were striking thirteen.”</strong></em> – George Orwell, 1984 (1949)</p>
<p><em><strong>“No one is useless in this world who lightens the burdens of another.”</strong></em> – Charles Dickens, Doctor Marigold (1874)</p>

But I found mistake in the file cit_lit_2.docx when I use pandoc -f docx+styles cit_lit_2.docx -t html. I added updated file, cit_lit_3.docx. ("Clear all styles" is needed prior to set new style...)

cit_lit_3.docx

Output of pandoc -f docx+styles cit_lit_3.docx -t html is following.

<div data-custom-style="Testo citato">
<p><span data-custom-style="Enfasi">“We are all in the gutter, but some of us are looking at the stars”</span> – Oscar Wilde, Lady Windermere’s Fan (1892)</p>
</div>
<div data-custom-style="Testo citato">
<p><span data-custom-style="Enfasi">“We cast a shadow on something wherever we stand, and it is no good moving from place to place to save things; because the shadow always follows. Choose a place where you won’t do harm – yes, choose a place where you won’t do very much harm, and stand in it for all you are worth, facing the sunshine.”</span> – E. M. Forster, A Room with a View (1908)</p>
</div>
<div data-custom-style="Testo citato">
<p><span data-custom-style="Enfasi">“Life appears to me too short to be spent in nursing animosity or registering wrongs.”</span> – Charlotte Brontë, Jane Eyre (1847)</p>
</div>
<div data-custom-style="Testo citato">
<p><span data-custom-style="Enfasi">“It was a bright cold day in April, and the clocks were striking thirteen.”</span> – George Orwell, 1984 (1949)</p>
</div>
<div data-custom-style="Testo citato">
<p><span data-custom-style="Enfasi">“No one is useless in this world who lightens the burdens of another.”</span> – Charles Dickens, Doctor Marigold (1874)</p>
</div>

But if you want to get div, span, em and strong tags at the same time, I don't have any idea...

lierdakil commented 4 years ago

@jkr, I did some work on docx reader about a year ago (#5732). I don't think I did much (if anything) with character styles though. I'll try to take a look, but like you, my schedule is a bit messy lately, so can't really promise I'll be able to do something in a timely manner.

lierdakil commented 4 years ago

After a quick look, @niszet's analysis is correct. Pandoc is ignoring run-style (aka character style) modifiers on paragraph-styles. I mean, we don't even have anywhere to put those in the data structure: https://github.com/jgm/pandoc/blob/804e8eeed2fbcd0b4a52ad908b8ccccf89563097/src/Text/Pandoc/Readers/Docx/Parse/Styles.hs#L114-L119

Curiously, run modifiers on paragraph styles shouldn't be particularly hard to thread through, at least as far as I can tell after a cursory look. I'll try to experiment with this probably next week.

lierdakil commented 4 years ago

Hmm. There's one curious caveat though.

Word is once again unlike everything else. Long story short, Word renders bold/italic character styles that are inside a paragraph with bold/italic run style as not bold/italic. For instance, the word useless in this screenshot is marked with Enfasi (i.e. "emphasis") style, which is defined as italic: image However, it appears as non-italic because it is a part of paragraph with style Testo citato (i.e. "quoted text"), which is defined as bold and italic. But the "Charles Dickens ..." part is neither bold nor italic, because it is specified as such inline (as opposed to a named style).

Frankly this is all a bit of a mess, and no other software that is able to read docx and which I bothered to test behaves this way: google docs, libreoffice, wordpad -- none of these "flips" italic/bold flags. Hence, the document will render differently in Word and these others.

The question is, which behaviour do we want to implement? The "correct" one (assuming we accept Word as the reference implementation), or the more common one? The latter is a bit more straightforward, but not hugely so.

lierdakil commented 4 years ago

@SarahSiani-IT98, out of curiosity, how did you produce cit_lit.docx? It doesn't look much like anything I've seen Word do, so I'm guessing it's not made in Word?

niszet commented 4 years ago

From @lierdakil san's comment, I checked the document. And I found following sentense in the unzipped docx at docProps/app.xml.

<Application>LibreOffice/6.4.4.2$MacOSX_X86_64 LibreOffice_project/3d775be2011f3886db32dfd395a6a6d1ca2630ff</Application>

I opened this original docx in LibreOffice Ver.6.4.3.2 /Windows10 (I don't have Mac). And it is "correctly" rendered as following as @SarahSiani-IT98 san said.

So, as @lierdakil san said, this was a rendering difference issue between applications. (In my opinion, docx's "correct" behaviour is Word's one, but...)

lierdakil commented 4 years ago

To anyone interested, I've opened PR #6504, not sure on some points, will be grateful for comments/critique.

mb21 commented 4 years ago

The question is, which behaviour do we want to implement? The "correct" one (assuming we accept Word as the reference implementation), or the more common one? The latter is a bit more straightforward, but not hugely so.

My feeling would be that if the common one would have been clearly easier to implement, it would be fine as we could point out to people that report this bug in the future that LibreOffice etc. also don't handle it... but if we can get proper Word-compatible behavior with reasonable amount of work, that's of course even nicer.

lierdakil commented 4 years ago

@mb21, the reason I'm unsure about this, a lot of people I know use LibreOffice or Google Docs for docx. While my sample is by no means representative, this observation suggests there will be a non-insignificant part of the user base which will be surprised by pandoc's behaviour either way. So it's not the question of "what could we tell people to divert blame" but of "how do we cause the least surprise" (in accordance with the principle of least astonishment)

Additionally, I believe Word's behaviour in this case is a bit counterintuitive. It seems to make sense once you think about it for a bit, but the first reaction of the uninitiated is surprise and confusion (I have some anecdotal evidence to back this up -- i.e. I've asked a few people, and they were surprised and confused unless they knew about this already)

FWIW #6504 implements some approximation of Word's behaviour. And since it's an undocumented behaviour, I can't of course guarantee if it's an exact match, and also I didn't do any extensive testing to determine boundary cases.

SarahSiani-IT98 commented 4 years ago

Hi Lierdakil. I didn't imagine causing so many problems! 😅

The problematic text comes from multiple sources: certainly from both Word (a lot) and LibreOffice (few). Maybe even Google Docs. My file is a new document on LibreOffice with text cutted & pasted from many sources. I have examples where I see (both in Word and else) the correct rendering but I don't get the correct formatting in the HTML resulting from pandoc.

By the way, I use docx+styles because I need to retrieve the name of the styles, but if this removes me <strong>/<em> I don't know what tool to use anymore 🤷‍♀️

@lierdakil How can I test your #6504 ? So?

> git clone https://github.com/lierdakil/pandoc.git
> cd pandoc
> stack setup
> stack install
mb21 commented 4 years ago

@SarahSiani-IT98 after the git clone, I think you need to switch to the correct branch:

git checkout docx-para-run-styles
lierdakil commented 4 years ago

@SarahSiani-IT98, you need to clone a different branch, git clone -b docx-para-run-styles https://github.com/lierdakil/pandoc.git. You also might want to add --depth 1 to avoid downloading the whole history.

docx+styles doesn't apply bold/italic/etc that come from named styles. The rationale behind this is you can add corresponding CSS manually if you so desire (however, replicating the curious Word behaviour discussed here might be a little tricky). Forcing bold/italic/etc in-line would severely limit the ability to post-process the output, and post-processing was the primary motivation behind +styles IIRC.

SarahSiani-IT98 commented 4 years ago

git clone -b docx-para-run-styles --depth 1 https://github.com/lierdakil/pandoc.git.

Thanks, now I try

The rationale behind this is you can add corresponding CSS manually if you so desire

I understand ... but this is only valid when it comes to adding a style. It does not work when it is removed!

I could verify that "negative" formatting is lost when it is already present in the style.

I have seen that has already been mentioned in issue #6452

Do you have any suggestions? (besides the extreme one of not using pandoc!🤦‍♀️😁)

lierdakil commented 4 years ago

this is only valid when it comes to adding a style. It does not work when it is removed!

This is actually a very valid point which I failed to consider. Indeed, inline formatting explicitly disabling bold/italic/etc doesn't get translated into output.

I've opened #6511 with some proof-of-concept code that might remedy that.

SarahSiani-IT98 commented 4 years ago

Thanks @lierdakil

The PR #6504 solves the problem with "-f docx" option. With #6511 I can manage the disabled style and thus solve problems with "-f docx+styles" option.

However, there are problems with some formatting invented or not closed properly. (This #6514 )

But I believe that the bug was also present in previous versions and does not depend on these yours changes.