kh-fataoui / xdocreport

Automatically exported from code.google.com/p/xdocreport
0 stars 0 forks source link

Issues and Questions with HTML to DOCX Conversion #439

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I am having three issues/questions with converting HTML to DOCX. I have 
attached files showing the issues.

(1) As  you can see from the in.docx and out.docx files attached, i am merging 
a bunch of HTML fields into a DOCX file. The first one has an ordered list, 
then an unordered list, then an ordered list, then an unordered list, etc.  
(This shows the issue raised in the other ticket about ordered lists not 
restarting from 1, but that is not the issue I am raising here)

The first ordered list does NOT show any numbering at all.  Any idea why?  The 
first unordered list DOES show with numbering instead of bullets. Any idea why? 
 Then the second unordered list shows with very strange indicators:  a pipe, a 
box and a bullet.  Again, why?

(2) In my java code (HTMLtoDOCXTest), why don't I need to include a line like:

metadata.addFieldAsList("projects.Text");

it seems to work fine either way.

(3) In my java code (HTMLtoDOCXTest), i have included this line so that the 
HTML field called "RichTextField" is treated as HTML:

metadata.addFieldAsTextStyling("RichTextField",SyntaxKind.Html);

What I can't understand is that I am bringing in three different objects/maps 
of information, each of which has its own HTML field.  But the names of those 
HTML fields do NOT match the name "RichTextField". Instead, those fields are 
called:  "Project_RichTextField", "Data_RichTextField" and "StuffRichTextField" 
with no underscore.  Why are all 3 of those fields treated as HTML when the 
field name does not match?  The other non-HTML fields I have, like "Text" is 
not treated as HTML.

Thanks,

Mark

Original issue reported on code.google.com by mark.sal...@highq.com on 1 Sep 2014 at 12:52

Attachments:

GoogleCodeExporter commented 9 years ago
> The first ordered list does NOT show any numbering at all.  Any idea why?  
The first unordered list DOES show with numbering instead of bullets. Any idea 
why?  Then the second unordered list shows with very strange indicators:  a 
pipe, a box and a bullet.  Again, why?

It was a long time that I have developped this feature. I cannot give you a 
good answer today. I must find time to study that.

> (2) In my java code (HTMLtoDOCXTest), why don't I need to include a line like:

> metadata.addFieldAsList("projects.Text");

metadata.addFieldAsList is used with table to generate lazy loop (#foreach 
before the the row end #end after the row).

If you write #foreach inside your docx (like you have done), you need not  
metadata.addFieldAsList

> Why are all 3 of those fields treated as HTML when the field name does not 
match? 

Fileds matching should be perhaps improved. Today if I remember, I use 
String#indexOf("RichTextField") != -1 to know if fields is a text styling.

Original comment by angelo.z...@gmail.com on 1 Sep 2014 at 2:01

GoogleCodeExporter commented 9 years ago
Thanks for your prompt response. I actually LIKE the fact that you are using 
indexof, which may help me greatly, so don't fix that!

I now see why I don't need to reference that field as a list.

If you can look into the two bugs for ordered and unordered lists (this one and 
the one I filed yesterday), that would be great.  Likely requires just some 
tweaks to the Word XML.

Original comment by mark.sal...@highq.com on 1 Sep 2014 at 2:03

GoogleCodeExporter commented 9 years ago
> If you can look into the two bugs for ordered and unordered lists (this one 
and the one I filed yesterday), that would be great.

I don't know when I will able to do that. Very busy today.

Original comment by angelo.z...@gmail.com on 1 Sep 2014 at 2:05

GoogleCodeExporter commented 9 years ago
Of course. I didn't mean to imply today. Whenever you can.

Original comment by mark.sal...@highq.com on 1 Sep 2014 at 2:19

GoogleCodeExporter commented 9 years ago
Let me know if this is helpful:  
http://msdn.microsoft.com/en-us/library/office/ee922775(v=office.14).aspx#odc_Of
fice14_ta_WorkingWithNumbering_LvlRestart

It discusses how to restart numbering.

Also:  http://openxmldeveloper.org/discussions/formats/f/13/p/6322/160904.aspx
http://openxmldeveloper.org/discussions/formats/f/13/p/754/1876.aspx

I hope this is helpful.

Original comment by mark.sal...@highq.com on 2 Sep 2014 at 3:22

GoogleCodeExporter commented 9 years ago
I think I figured out why the first list does not show any numbering.  Here is 
what the document.xml for that part looks like:

<w:pPr>
        <w:numPr>
            <w:ilvl w:val="0" />
            <w:numId w:val="0" />
        </w:numPr>
    </w:pPr>

The key part is:  <w:numId w:val="0" />.  The @w:val should never be below 1. 
It needs to start at 1.  

If you go here:  
http://msdn.microsoft.com/en-us/library/office/ee922775%28v=office.14%29.aspx, 
you will note this comment:

The w:numId can contain a value of 0, which is a special value that indicates 
that numbering was removed at this level of the style hierarchy. While 
processing this markup, if the w:val='0', the paragraph does not have a list 
item.

When I compare that with a manually created Word document with multiple ordered 
lists, the first one starts with @w:val="1". That also means in the 
numbering.xml file, the first one of these:

<w:num w:numId="1"><w:abstractNumId w:val="###"/></w:num>

should have a w:numid="1"

Hopefully this is a simple fix.

That's why the numbering does not start until the the second list, which has a 
@w:val="1".

Original comment by mark.sal...@highq.com on 5 Sep 2014 at 5:50