Support for async filter?

sergiocorreia / panflute

An Pythonic alternative to John MacFarlane's pandocfilters, with extra helper functions

http://scorreia.com/software/panflute/

BSD 3-Clause "New" or "Revised" License

500 stars 59 forks source link

Support for async filter? #144

Closed AdamWagner closed 4 years ago

AdamWagner commented 4 years ago

Thanks for the great program, @sergiocorreia!

I wrote a filter that replaces bare URLs with a markdown-formatted link, which requires requesting the url and parsing the title using https://github.com/impredicative/urltitle.

It works great, but it's pretty slow on a very large file. For instance, it takes 22 seconds to process a file containing 45 urls. The same file only takes 5.5 seconds to process on the command line using … | parallel url-to-title

Is there a standard way to execute filters in parallel, or is this generally not a need?

ghamerly commented 4 years ago

Have you tried using --verbose to see the timing of each filter? That may give you more information on how pandoc is spending its time.

sergiocorreia commented 4 years ago

Hi Adam,

I haven't delved much into optimizations because for my workflows panflute has always been a fraction of the time (e.g. if building a PDF takes 62 seconds, 60 is spent on the latex step, 1.5 is spent on pandoc, and 0.5 on panflute).

Can you give me a MWE of the script + a markdown file I can run, so I can use it as benchmark?

Also, a quick scan at urltitle suggests that you are not just parsing the title but actually fetching each of the websites? If that's the case, one option is to do it in three steps, first get the list of urls, then run them through a parallelized requests.py (haven't used it but this might help), and at the end just replace the URLs with a title based on what you fetched.

AdamWagner commented 4 years ago

Can you give me a MWE of the script + a markdown file I can run, so I can use it as benchmark?

Sure, I've added this as a todo over the weekend.

ickc commented 4 years ago

I don't think async is a good idea. You cannot guarantee a filter(which can be anything out there)'s action is independent of execution order. (Please prove me wrong if I did.)

A better way of writing filter like this is to run it in 2 passes, the first round gathering the urls, then do your own requests (e.g. multithread map or more fancy things such as encode/httpx, cache your result say in a dict, then in 2nd round you modify your content using the cached results such that now the function is instantaneous.

ickc commented 4 years ago

For now I will close this first, since I think async actually won't work for all general filter. I think the suggestions given here is enough to solve the problem at hand.

Reopen it if when I said is not true.

sergiocorreia commented 4 years ago

A few extra thoughts

It's always possible to do this in two passes using the prepare or finalize functions. For instance, prepare() can include a walk command to get the list of all urls, then in parallel retrieve them and save the results to a dict (e.g. doc.results). Then, the filter can just fetch from doc.results. This puts the onus of dealing with multithreading/multiprocessing/async to the filter itself (which is fine).
Having filters in parallel is tricky, because they are modifying the same object. There are other alternatives though (e.g. partition the document) but we would need to have more evidence that this actually matters (is there a workflow where panflute is slow enough to benefit from this)