sergiocorreia / panflute

An Pythonic alternative to John MacFarlane's pandocfilters, with extra helper functions
http://scorreia.com/software/panflute/
BSD 3-Clause "New" or "Revised" License
493 stars 60 forks source link

Navigation with next and prev unstable #223

Closed erbsenmann closed 1 year ago

erbsenmann commented 1 year ago

I ran into some endless loop problems when using next and prev on elements while running filters on my document. Background is, that I am trying to merge several blocks if they have the same style in a word export, since pandoc creates a paragraph for each line.

Having a closer look I discovered, that some index values of the Elements I used next() on were not in sequence. The Element.index property use container.index to determine its position in the parent. Unfortunately this will return the first occurrence of the element, so if exactly the same element is created twice this leads to wrong index value.

To reproduce I created a markdown file with 4 paragraphs

Hello

World

Hello

Europe

I exported this to json with pandoc and used it here

import io
import panflute as pf

data = io.StringIO(
    '{ "pandoc-api-version": [1, 22, 2, 1], "meta": { }, "blocks": [{ "t": "Para", "c": [{ "t": "Str", "c": "Hello" }] }, { "t": "Para", "c": [{ "t": "Str", "c": "World" }] }, { "t": "Para", "c": [{ "t": "Str", "c": "Hello" }] }, { "t": "Para", "c": [{ "t": "Str", "c": "Europe" }] }] }'
)

doc = pf.load(data)

for idx, element in enumerate(doc.content):
    assert element.index == idx, "Invalid index: expected {} found {}".format(
        idx, element.index
    )

This resulted in this output, because the third block has the same content than the first

Traceback (most recent call last):
  File "K:\Data\private\pandoc\bug.py", line 22, in <module>
    assert element.index == idx, "Invalid index: expected {} found {}".format(
AssertionError: Invalid index: expected 2 found 0

Thanks for this great project and sorry if I missed something, that is my first bug report ever.

lewer commented 1 year ago

Thanks! This helped me because with this PR, Element.index is much faster; before on a paragraph of 1000 words it took approx 1ms on my machine to compute e.index for e an element at the end of the paragraph. Now it takes only 1μs and I'm able to run my filter in 4s instead of 6s.

sergiocorreia commented 1 year ago

BTW I'm pretty sure there's a lot of low hanging fruit in terms of speedups; after all 4s on a multicore gigahertz computer is not super fast, even within Python :)

lewer commented 1 year ago

lol you don't even know what my filter does! 4s is not so bad in my case. But yes, could probably be improved. As far as multicore is concerned, it's not obvious how panflute could use it... I think pandoc runs on single core.