vsch / flexmark-java

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
BSD 2-Clause "Simplified" License
2.28k stars 270 forks source link

Nested lists export to docx problem #329

Open dmitrymurashenkov opened 5 years ago

dmitrymurashenkov commented 5 years ago

I export nested lists to docx and open them with LibreOffice and Google Docs.

Basic case of nested ordered list:

1. P1            
2. P2            
    1. P2-1            

LibreOffice displays it as list, but no indent for nested items (google docs shows indents in this case):

1. P1            
2. P2            
1. P2-1            

More complex example:

1. P1            
2. P2            
    1. P2-1

1. P1            
2. P2            
    1. P2-1                            

In LibreOffice (google docs correctly indents it):

1. P1
2. P2
1. P2-1
3. P1
4. P2
1. P2-1

But if we add more levels:

1. P1
    1. P11
        1. P111

then LibreOffice starts to indent from the level 2:

1. P1
1. P11
   1. P111

But if I select all text in LibreOffice and copy-paste it to notepad - all levels are correctly indented.

So there may be 2 issues here:

  1. Indent in default styles not understood by LibreOffice in some cases.
  2. Several consecutive list get united. Seems this is more of a .docx issue than flexmark.

Here is an example export:

https://drive.google.com/file/d/1db_Cx1gi-vOGwYhroELcTBA08FSVwnd_/view?usp=sharing

Options used:

MutableDataSet options = new MutableDataSet()
                    .set(Parser.EXTENSIONS, Arrays.asList(
                            TablesExtension.create(),
                            TocExtension.create()))
                    .set(DocxRenderer.SUPPRESS_HTML, true)
                    .set(DocxRenderer.DOC_RELATIVE_URL, "file://" + imgDir.getAbsolutePath())
                    .set(DocxRenderer.DOC_ROOT_URL, "file://" + imgDir.getAbsolutePath())
vsch commented 5 years ago

@dmitrymurashenkov, I suspect that there is some constraint in LibreOffice or some setting in number list style which LibreOffice expects that is missing.

If someone has the time and knowledge of DOCX format to point out the problem then I can address it. To me, DOCX is not a format, it is a career decision. On my own, I would only venture into it under duress while wearing a gas mask and hip waders.

I only implemented docx conversion because it was sponsored. Johner Institut needed a way of converting their internal markdown documents to Word for their clients.

I was not familiar with DOCX format and can only claim some familiarity now. I came to the impression that it was a format created by a summer intern not intended as a standard to be independently implemented.

It is a PITA to use, extremely sensitive to interpretation due to lack of documentation and is a nightmare of referential dependencies.

Docx4j library does a good job of simplifying working with it but by no means is it a cake walk. I only test the conversion against MS Word because it is the "standard" for DOCX format.

There are many things in docx I still have not figured out how to properly use and rely on trial and error in figuring out what I need. Sometimes the mistake is trivial and sometimes requires a lot of head scratching.

The job is made much harder by the fact that DOCX is not a standard but a proprietary format. The huge, byzantine XML generated for DOCX makes visual inspection to figure out what is wrong an undertaking of hours.

I think a better solution to DOCX conversion would have been to implement Markdown to RTF converter. RTF is a better documented standard which MS-Word can open.

All that to say that there is probably a ton of differences between how Word and LibreOffice will interpret a converted document. If you need Markdown to DOCX conversion then PanDoc would probably be a more stable choice.