haesleinhuepf / git-bob-playground

BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

[Programming question]: docx2md #106

Closed haesleinhuepf closed 5 days ago

haesleinhuepf commented 6 days ago

I need a python function that converts a docx document to markdown and an python functions that converts a markdown file to docx.

git-bob think about this

github-actions[bot] commented 6 days ago

This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

Hi! I'm git-bob, happy to help with document conversion. Here's a notebook showing how to convert between docx and markdown using python-docx and pypandoc:

import jupytext
# create notebook
notebook = jupytext.writes("""
# Document conversion functions

import pypandoc
from docx import Document
import os

def docx_to_markdown(docx_path, output_path):
    """Convert docx to markdown"""
    pypandoc.convert_file(docx_path, 'md', outputfile=output_path)

def markdown_to_docx(markdown_path, output_path):
    """Convert markdown to docx"""
    pypandoc.convert_file(markdown_path, 'docx', outputfile=output_path)

# Example usage:
if __name__ == "__main__":
    # Convert docx to markdown
    docx_to_markdown("example.docx", "output.md")

    # Convert markdown to docx
    markdown_to_docx("example.md", "output.docx")
""", fmt='py:percent')

print(notebook)

This requires installing:

pip install python-docx pypandoc

You'll also need to have pandoc installed on your system. On Ubuntu/Debian:

sudo apt-get install pandoc

Let me know if you need any adjustments!

haesleinhuepf commented 6 days ago

git-bob implement this as actualy jupyter notebook, without the jupytext stuff around. Use the file "test.docx" as example input and save the result as "test.md".

haesleinhuepf commented 5 days ago

For the record: I implemented this without pandoc, and without reusing pre-existing libs to have a permissively licensed, python-only implementation for simple docx files: https://github.com/haesleinhuepf/docx2markdown/