Closed AdamWagner closed 4 years ago
Have you tried using --verbose
to see the timing of each filter? That may give you more information on how pandoc is spending its time.
Hi Adam,
I haven't delved much into optimizations because for my workflows panflute has always been a fraction of the time (e.g. if building a PDF takes 62 seconds, 60 is spent on the latex step, 1.5 is spent on pandoc, and 0.5 on panflute).
Can you give me a MWE of the script + a markdown file I can run, so I can use it as benchmark?
Also, a quick scan at urltitle suggests that you are not just parsing the title but actually fetching each of the websites? If that's the case, one option is to do it in three steps, first get the list of urls, then run them through a parallelized requests.py (haven't used it but this might help), and at the end just replace the URLs with a title based on what you fetched.
Can you give me a MWE of the script + a markdown file I can run, so I can use it as benchmark?
Sure, I've added this as a todo over the weekend.
I don't think async is a good idea. You cannot guarantee a filter(which can be anything out there)'s action is independent of execution order. (Please prove me wrong if I did.)
A better way of writing filter like this is to run it in 2 passes, the first round gathering the urls, then do your own requests (e.g. multithread map or more fancy things such as encode/httpx, cache your result say in a dict, then in 2nd round you modify your content using the cached results such that now the function is instantaneous.
For now I will close this first, since I think async actually won't work for all general filter. I think the suggestions given here is enough to solve the problem at hand.
Reopen it if when I said is not true.
A few extra thoughts
doc.results
). Then, the filter can just fetch from doc.results
. This puts the onus of dealing with multithreading/multiprocessing/async to the filter itself (which is fine).
Thanks for the great program, @sergiocorreia!
I wrote a filter that replaces bare URLs with a markdown-formatted link, which requires requesting the url and parsing the title using https://github.com/impredicative/urltitle.
It works great, but it's pretty slow on a very large file. For instance, it takes 22 seconds to process a file containing 45 urls. The same file only takes 5.5 seconds to process on the command line using
… | parallel url-to-title
Is there a standard way to execute filters in parallel, or is this generally not a need?