python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.64k stars 1.13k forks source link

add_paragraph() performance drops as document length increases #408

Open ConnorKrammer opened 7 years ago

ConnorKrammer commented 7 years ago

Recently, I used python-docx to combine several hundred blog posts into a single Word document for editing. While doing this, I was surprised to note that with every paragraph added using Document.add_paragraph, the time required to add the next paragraph increased.

Fortunately, python-docx offers an alternate method of adding paragraphs: Paragraph.insert_paragraph_before. I used this to compare performance, and indeed, the difference is huge. I suspect that somewhere in the code for add_paragraph, every single paragraph is being needlessly iterated over—this has the whiff of Shlemiel the painter’s algorithm at work.

Here's a test program I wrote in Python 3.4.2 and the resulting performance numbers:

from docx import Document
from timeit import default_timer as timer

def p_add(doc, leader):
    doc.add_paragraph()

def p_insert(doc, leader):
    leader.insert_paragraph_before()

def time_func(func, iterations, steps):
    doc = Document()
    leader = doc.add_paragraph()

    for i in range(steps):
        start = timer()
        for n in range(round(iterations / steps)):
            func(doc, leader)
        yield timer() - start

def print_times(func, iterations, steps):
    total = 0
    for step, time in enumerate(time_func(func, iterations, steps)):
        total += time
        print('  {0}: {1}'.format(step, time))
    print('  ---------------------')
    print(' ', total)

# -----------------------------------------

NUM_PARAGRAPHS = 50000
STEPS = 10

print('Paragraph count:', NUM_PARAGRAPHS)
print('Step count:', STEPS)

print('\nAdd method:')
print_times(p_add, NUM_PARAGRAPHS, STEPS)

print('\nInsert method:')
print_times(p_insert, NUM_PARAGRAPHS, STEPS)

Output:

Paragraph count: 50000
Step count: 10

Add method:
  0: 0.347790239300581
  1: 0.6364703527799802
  2: 0.9327317248817353
  3: 1.1597683542322563
  4: 1.422699215445788
  5: 1.6892252730288
  6: 1.871260153318639
  7: 2.1984375628100956
  8: 2.4622134469837906
  9: 2.8904271214737616
  ---------------------
  15.611023444255427

Insert method:
  0: 0.07433165782728324
  1: 0.07363759291617633
  2: 0.07403786694993819
  3: 0.07365683686010804
  4: 0.07468232525002172
  5: 0.07722509170808323
  6: 0.07526477528632824
  7: 0.07458995431915483
  8: 0.07346354213440165
  9: 0.07398697740932292
  ---------------------
  0.7448766206608184
scanny commented 7 years ago

@ConnorKrammer The code for .add_paragraph() does indeed iterate over the elements as it uses a general-purpose facility for adding XML elements in the order prescribed by the XML Schema.

This could be overridden with a more clever method on docx.oxml.document.CT_Body having the name .add_p().

Part of the challenge is that the new <w:p> element can't simply be appended, it must appear before an optional w:sectPr element at the end of the w:body element. But I expect some clever XPath work could improve the timings markedly. I'll be willing to consider a pull request to that end.