Open fokoenecke opened 10 years ago
I think it's an interesting idea. I believe that ReportLab has something analogous for adding content to a PDF document. There they use HTML as I recall.
It would be important for it to be fairly well separated from the main API, and I'm thinking it would be good for it to have something like a plug-in architecture since there's quite a variety of formats that folks might want. reStructuredText and HTML both come to mind in addition to Markdown.
I'm inclined to think something like this could be done with a function call as a start, something like:
append_markdown_content(document, markdown_text)
that would interpret _markdowntext into Word objects and add them to document in the order they were encountered.
I don't think this will be a feature that will come soon, given we still have a good deal of core library work to do. So a separate module that implemented _append_markdowncontent() might be a useful adjunct for a year or two and perhaps beyond.
Today i took a look at python-markdown. They parse the markdown code with regular expressions and convert it into HTML by using the python xml library. They also provide an Extension API for writing own processors, handling the markdown elements.
I think I'll give it a try and write an extension that generates OOXML instead of HTML based on your low level API.
That would be great! Even an early prototype I'm sure would flush out a lot of learning about what would be possible and how much work it might take. I'll look forward to hearing what you come up with :)
@fokoenecke I just stumbled across this and am the lead dev for Python-Markdown. If there is anything I can do to help, let me know. BTW, while it is not documented, you can swap out Python-Markdown's serializer for your own. So a library that converted ElementTree HTML objects to docx.Document objects would be were I would start. You could plug that into Python-Markdown as a serializer... or after parsing any HTML with lxml or [html5lib](both output ElementTree Objects), use it to convert any HTML document (not just markdown) to docx.
In fact, the latter is probably preferable. For example, Markdown permits raw HTML. The Python-Markdown parser runs a pre-processor that just stores the raw HTML and inserts placeholders in the document that pass through the parser. A post-processor (run after the serializer) then swaps out the placeholders and reinserts the raw HTML. For complete markdown support, you would need to run through the entire markdown parser and then pass the HTML string into an HTML parser before converting to docx. Of course, that requires getting an HTML parser working first. Swapping out Python-Markdown's serializer might be an easier place to start - even if it doesn't give you the ability to support raw HTML.
Hmm, I think @waylan's idea for making the 'translator' based on HTML is an idea worth exploring seriously. There are a lot of simplified markup languages out there, but they all have HTML in common. Plus it can be a handy way to build up some rich text in any program, even if you're doing it "manually". If there were an HTML translator then any way we ended up with it would work. Great call @waylan :)
@waylan thanks for your reply! My idea was to build my own processors for python-markdown that generate a ElementTree of docx objects directly. But I see that it would be far more flexible parsing HTML and converting those objects to docx.
In fact, I tried that before on a higher level by rendering a HTML page with mako and converting it via pypandoc to docx. I have not been satisfied by the results and the lack of control this solution provides.
I guess I have to take a deeper look at python-docx to find the best place to insert the generated doxc. This dictates if i can use text generated by a serializer or if I have to get the results as a branch of an ElementTree.
As soon as I make a plan I will inform both of you and ask for advice ;)
I decided to go with waylans plan and hand a html string to pydocx. That string will be parsed to an ElementTree by lxml. Then I am going to translate the different tags to pydocx functions.
A very early prototype for this already worked, at least for my requirements.
Glad to hear it worked out @fokoenecke :)
I'll close this for now then since there's no immediate prospect of development on it. Happy if someone wants to reopen it in future.
oh, I hope you didn't get me wrong on that. I am far from done now and I'd love to contribute what i did afterwards. I just don't know, if anyone can use it, because I won't do a full-flectched HTML conversion, but only the stuff needed to get from markdown to docx. I will tell you, when I am satisfied and the code is tested, so you can decide, if it makes sense.
I opened a branch for it here: https://github.com/fokoenecke/python-docx/tree/html
Ah, ok, no problem, opening it back up :)
Just a little update on what i did so far: I created an API function that takes an HTML string and a docx container to add it to. The function hands the HTML to a builder class, that recusively iterates through the HTML elements and tries to find a fitting OOXML object to append to the given container. This works by mapping the HTML tags to various dispatcher objects.
By now my code can barely handle the HTML I get from the markdown converter. It's far from covering full-stack HTML. But it does all the things I need it to. I can append small markdown snippets to the document and paragraphs pretty well.
If no one else is interested in this, it's not likely that I will change much more on it. I will fix problems that arise from unlikely markdown combinations and add missing features as they appear in python-docx.
@fokoenecke : Any progress on this ? I still have a use case for this.
@tooh I pulled the functionality out of python-docx and implemented some improvements. Unfortunately it is still far from stable, but it works for my use cases. If you are interested in it, I could upload my current code. The basic idea behind it did not change from what you can see in the fork posted above.
@fokoenecke I would appreciate it if you could upload your current code. I think your solution could work for me as well.
@tooh i pushed my current code to https://github.com/fokoenecke/html_docx. I am using it in a project i'm working on at the moment. Unfortunately the priorities shifted and that part is not as important at the moment. But i guess, i will come back to it next year.
What it does: It takes an element tree, representing the HTML structure of your markdown (i am using https://github.com/waylan/Python-Markdown to convert) and creates the corresponding docx elements inside a paragraph. Obviously it uses python-docx to to that.
paragraph = document.add_paragraph()
html_tree = markdown(markdown_string)
add_html(paragraph, html_tree)
How it works:
It's basically a module you put inside your sources and import the interface method add_html
. This method takes your container element (paragraph) and recursively steps through the html element tree. This happens in _append_docx_elements
inside the DocxBuilder
class. It maps the HTML tag to a dispatcher class from the tag_dispatchers directory, hands over the tag content, recursively calls its children and hands over the tag tail after doing so.
The dispatcher classes call the python-docx functions to create the appropriate objects. They share some helper methods which figure out which kind of container we are in and provide fitting paragraph objects.
Like i said, this won't be able to handle all kind of HTML input, but the basic output of markdown is mapped pretty OK to docx.
@fokoenecke Thx for uploading. I have a basic sample running now. Headings are created correctly from # and ## and so on. I had to change Heading to Kop in heading.py, because I run a Dutch Word version. This is the same problem as I had to fix in python- docx. However I have a problem with bullets. The code generates list items and applies style BulletList. I assume I have a localization problem here as well. Did not figure out how to handle this.
Example code
from html import add_html
from docx import Document
document = Document("template.docx")
import markdown
markdown_string= "# heading 1 \n## heading 2 \n - bullet \n - bullet"
paragraph = document.add_paragraph()
html_tree = markdown.markdown(markdown_string)
print html_tree
add_html(paragraph, html_tree)
document.save('demo.docx')
Example code is working but in my real project I encounter following error when calling it like this:
html = markdown(story['detail'])
add_html(paragraph_detail, html)
File "/Scripts/html/tag_dispatchers/__init__.py", line 41, in get_new_paragraph
new_paragraph = container._parent.add_paragraph()
AttributeError: '_Cell' object has no attribute 'add_paragraph'
paragraph_detail is a paragraph within a table cell.
@fokoenecke Would appreciate it if you can give me a hint.
I agree that these style problems originate from localization. It depends on your template.docx and the styles saved in it. If there is no ListBullet
or ListNumbers
, the styles will break. I took my a while to get my templates and styles working, but the docs helped a lot, it's more of a MS Word problem. I suggest trying the default template included in python-docx. If it works there, it's a style-naming problem.
The '_Cell' object has no attribute 'add_paragraph'
could be an error in my code, but it's likely that you didn't install the latest python-docx version. _Cell.add_paragraph
has been implemented in 0.7.4 (July)
. I suggest updating the package (pip install -U python-docx
).
I hope this is helping! :)
@fokoenecke Wow, spot on. Normally I install updates when they are released, but I missed this one. Updating solved :The '_Cell' object has no attribute 'add_paragraph' error. Thx.
Next to solve the styling problem. To be continued.
Can't say I completely understand why the bullets are working now as well. What I did:
_list_style = dict(
ol='ListNumber',
ul='Bulletedlist',
)
and I changed
style = _list_style.get(element.getparent().tag, 'Bulletedlist')
I leave it up to @scanny if this item should be closed or not. For my use case I'm happy now.
No idea if its code can be useful for this issue, but I found this project:
Any advances on this? My use case:
What would you suggest for this last step? I looked into @fokoenecke's html_docx repo but it looks pretty focused in a particular configuration and difficult to reuse. There's a @scanny's comment about an rst conversor with a "Service class that knows how to render a RestructuredText string to a python-docx Document object". So I wonder if that could be an easier way to go (i.e., learning some RestructuredText syntax -something new to me- and changing my python docx-embedded code to genrate Rst strings, instead of html strings).
@scanny would you mind to share a simple example on how to use your class to inject some rst stuff into an extant .docx document?
Thanks @abubelinha
Hi! Thanks for this great library. After looking at some of the XML docs, i really see the pain in creating it ;)
I am developing a webapplication taking user input in a form and generating a docx file from it. Amongst others some fields are formatted in markdown. I am planning to take the markdown fields, convert them to XML (with pandoc or python markdown) and put it into the document via your low level API.
Is there a better/easier way to do this or any plans for implementing markdown directly into python-docx?
greatings