PDF Converter loses formatting

GoogleCodeExporter commented 8 years ago

The problem is described below, but firstly can I request that you:

  1. Provide a list of the PDF converter improvements that you mention 
     in Issue 159 ("I suggest you to use 1.0.0-SNAPSHOT (use maven for 
     that or download at hand on the for docx converter because teh 
     converter is very very lot improved (I'm improving again). The 
     docx converter 0.9.8 is very bad.").   

  2. Indicate when 1.0.0-SNAPSHOT will be released - presumable as 
     version 1.0.0.

This would be very useful info for me in deciding whether to use xdocreport 
going forwards.

What steps will reproduce the problem?

  1. Run FormattingTests.docx through the PDF converer code (eg see 
     attached modified java junit and associated docx file).

  2. Observe the output in the PDF conversion (see attached pdf file).

What is the expected output?

. It is expected that the pdf formatting matches the docx exactly. The 
following is an analysis of the differences. Note that in addition to these, 
header and footer formatting did not work at all well.

  . Tables: 
    . Row height of less than 1cm is converted to 1cm. 
    . A table which is not of full page width will be centred in 
      the page. 
    . Coloured table borders are converted to black. 

  . Free text: 
    . The number of characters per line appears to have increased
      between docx and pdf. The font size produced in the PDF appears 
      to be slightly larger than the source. This is difficult to 
      determine and requires further analysis to confirm what the 
      exact nature of the difference is. 

 . Font/style: 
    . The PDF and DOCX rendering of the different fonts and sizes
      differs slightly (as mentioned in the free text section). 
    . Header 3 styling appears to be too small. 
    . Strikethrough appears as normal text. 
    . Subscript appears as normal text. 
    . Superscript appears as normal text. 
    . Highlighting is lost. 

 . Bullets: 
    . Microsoft Word bullets are lost 
    . Microsoft Word numbering is lost 
    . Microsoft Word multilevel lists lose numbering and indentation 
      beyond the first item. 

    Note however that all of these bullet representations can be
    reproduced as non-Microsoft bullets using normal text and will 
    survive the pdf translation (see example in attached files).

 . Tabs: 
    . Tabs within text are lost. 

 . Images: 
    . Text alongside an image results in both the text and the image
      being misplaced. 

What do you see instead?

. See above.

What version of the product are you using? 

. XDocReport 0.9.8

On what operating system?

. Windows XP.

Please provide any additional information below.

Original issue reported on code.google.com by Mr.M.McM...@googlemail.com on 17 Oct 2012 at 1:27

Attachments:

GoogleCodeExporter commented 8 years ago

Hi, 

At first many thank's for your docx. I have add it to our Junit docx->pdf 
converter 
http://code.google.com/p/xdocreport/source/browse/#git%2Fthirdparties-extension%
2Forg.apache.poi.xwpf.converter.pdf%2Fsrc%2Ftest%2Fresources%2Forg%2Fapache%2Fpo
i%2Fxwpf%2Fconverter%2Fcore

To answer to your 2 questions: 

> 1. Provide a list of the PDF converter improvements that you mention 

See in the http://code.google.com/p/xdocreport/issues/list which starts with 
docx->converter.

See too below link where you can see docx which we use to test our converter. 
It exists again some problem but the result can start be good.

>  2. Indicate when 1.0.0-SNAPSHOT will be released - presumable as version 
1.0.0.

I don't know. I would like finish to improve pdf->docx converter to manage 
commons case (for instance shape will not be supported in the 1.0.0) but 
develop a a docx->pdf converter is a very hard task for me (I'm not an expert 
with iText and with docx). We develop XDocReport on our spare time so I cannot 
tell you when it will be release (I hope we will able to do this release in one 
months, but I cannot promise that).

If you want test 1.0.0 docx->pdf converter, you can : 

1) test it with our live demo at 
http://xdocreport-converter.opensagres.cloudbees.net/
2) get sources from Git and build it yourself.
3) use maven to get docx->pdf converter from maven central with version 
1.0.0-SNAPSHOT.

>  . Tables: 
>    . Row height of less than 1cm is converted to 1cm. 
fixed. Try it with the live demo.
>    . A table which is not of full page width will be centred in 
>      the page. 
fixed. Try it with the live demo.
>    . Coloured table borders are converted to black. 
fixed. Try it with the live demo.

Table is very improved but now there is a problem with inside borders which are 
doubled. I must manage that by developping the Conflict adjacent borders 
algorythm.

>  . Free text: 
>    . The number of characters per line appears to have increased
>      between docx and pdf. The font size produced in the PDF appears 
>      to be slightly larger than the source. This is difficult to 
>      determine and requires further analysis to confirm what the 
>      exact nature of the difference is. 
it's fixed. The problem came from that default font (Calibri) was not 
retrieved. It works in my local JUnit but the live demo doesn't work? Why? I 
don't know?

> . Font/style: 
>    . The PDF and DOCX rendering of the different fonts and sizes
>      differs slightly (as mentioned in the free text section). 
>    . Header 3 styling appears to be too small. 
it's the same problem than below. The default font Calibri was not applyed.
>    . Strikethrough appears as normal text. 
Ok I have created http://code.google.com/p/xdocreport/issues/detail?id=169 issue
    . Subscript appears as normal text. 
Ok I have created http://code.google.com/p/xdocreport/issues/detail?id=170 issue
    . Superscript appears as normal text. 
Ok I have created http://code.google.com/p/xdocreport/issues/detail?id=171 issue
    . Highlighting is lost. 
Ok I have created http://code.google.com/p/xdocreport/issues/detail?id=172 issue

> . Bullets: 
>    . Microsoft Word bullets are lost 
>    . Microsoft Word numbering is lost 
>    . Microsoft Word multilevel lists lose numbering and indentation 
>      beyond the first item. 

Bullet/Numbered list is not managed. See 
http://code.google.com/p/xdocreport/issues/detail?id=151   
>    Note however that all of these bullet representations can be
>    reproduced as non-Microsoft bullets using normal text and will 
>    survive the pdf translation (see example in attached files).

> . Tabs: 
>    . Tabs within text are lost. 
Tabs is very complex to manage. I have started to manage it. See 
http://code.google.com/p/xdocreport/issues/detail?id=164

> . Images: 
>    . Text alongside an image results in both the text and the image
>      being misplaced. 
Image is basic for the moment (but it seems that 1.0.0 resolves a little your 
problem (just a problem with space before the image).

Don't hesitate to create some issue and attach docx sample as you have done 
like this. I will add it in our JUnit docx. More we will have docx sampel to 
convert, more converter will be improved.

Many thank's.

Regards Angelo

Original comment by angelo.z...@gmail.com on 18 Oct 2012 at 9:39

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Hi Angelo

Thanks for the response. That's excellent news.

The document I provided was generated from MS Word. Have you performed any 
testing with docx produced via other technologies such as LibreOffice?

Do you know if docx from other technologies also currently exhibit the same 
formatting issues?

Thanks again.

Mike

Original comment by Mr.M.McM...@googlemail.com on 22 Oct 2012 at 3:17

GoogleCodeExporter commented 8 years ago

Hi Mike,

> The document I provided was generated from MS Word. Have you performed any 
testing >with docx produced via other technologies such as LibreOffice?

A little, but I have not done big test. But I think it should not have problem 
with LibreOffice because it should follow ooxml specification.

> Do you know if docx from other technologies also currently exhibit the same 
formatting issues?

I'm not an expert with that, but I think it should not have problem if 
LibreOffice follows ooxml specification. 

You are welcome to test that:)

Regards Angelo

Original comment by angelo.z...@gmail.com on 22 Oct 2012 at 3:24

GoogleCodeExporter commented 8 years ago

Hi Mike,

For your information, I have improved the position of the image (but it should 
again improved). With your docx sample, the image are well positionned.

Next step is to improve table border (o avoid that border are doubled).

Regards Angelo

Original comment by angelo.z...@gmail.com on 31 Oct 2012 at 5:02

GoogleCodeExporter commented 8 years ago

Hi Angelo

Thanks for the update. Do you have a release date in mind?

Original comment by Mr.M.McM...@googlemail.com on 5 Nov 2012 at 7:09

GoogleCodeExporter commented 8 years ago

Hi Mike,

Before doing the release, I would like manage : 

 * hyperlink
 * table border
 * bullet/numbered list

I hope I will finish this month to do this release.

Regards Angelo

Original comment by angelo.z...@gmail.com on 6 Nov 2012 at 8:37

GoogleCodeExporter commented 8 years ago

Hello Angelo,

Have you manage to find a solution for the bullet without changing then to 
normal text?

Thanks,
Bilel

Original comment by oueslatibilel216@gmail.com on 8 Aug 2013 at 6:21

GoogleCodeExporter commented 8 years ago

Hi Bilel,

XDocReport 1.0.3 (not released) improves a lot the font to use (line height, 
font symbol, asian font, etc). Tell me if it works with you.

Regards Angelo

Original comment by angelo.z...@gmail.com on 8 Aug 2013 at 7:13

GoogleCodeExporter commented 8 years ago

hi angelo ,

Is the issue reported above has been resolved .i am using version 1.0.2 of 
xdocreport

Original comment by surabhi....@gmail.com on 10 Mar 2015 at 6:26

GoogleCodeExporter commented 8 years ago

A lot of issues was fixed but not the whole. I have no time to support today 
this converter.

Note that last version is 1.0.5.

Original comment by angelo.z...@gmail.com on 10 Mar 2015 at 8:00

GoogleCodeExporter commented 8 years ago

hi angelo,i have used last version 1.0.5 ,i find that bullet issue has not been 
resolved till now and also the formatting of generated pdf is not same as 
source document when it is being generated using xdocreport.

Original comment by surabhi....@gmail.com on 13 Mar 2015 at 6:57

GoogleCodeExporter commented 8 years ago

> i find that bullet issue has not been resolved 

If I remember, it's a font problem that you must have installed in the computer 
which converts the docx to PDF.

> also the formatting of generated pdf is not same as source document when it 
is being generated using xdocreport. 

Yes I know it's not perfect, but I have no time today to work on this topic. 
Any contribution are welcome!

Original comment by angelo.z...@gmail.com on 13 Mar 2015 at 8:28

tianchiing / xdocreport

PDF Converter loses formatting #168