pelican-plugins / pandoc-reader

Pandoc Reader is a Pelican plugin that processes Markdown content via Pandoc
11 stars 3 forks source link

Migrating to and processing speed #11

Closed reagle closed 3 years ago

reagle commented 3 years ago

Hi, I believe I've experimented with this plugin the past, but I see it's configuration commented out with the annotation that it was too slow -- over 10 minutes.

Trying the version in PyPi it took 77 seconds across my ~300 files. However, this was skipping each one because the metadata isn't correct.

ERROR: Skipping /Users/reagle/data/2web/reagle.org/joseph/content/social/chain-letters.md: could not find information about 'title'

I'd like to give this an honest go, before that:

  1. Does anyone have a utility that converts pelican metadata to pandoc YAML?
  2. Is the version in PyPi likely to get faster? I fear that once this plugin doesn't skip my ~300 files (including entries that use a ~6MB yaml bibliography) it will be onerously slow.
nandac commented 3 years ago

@reagle I do not know of any ready-made utility that converts Pelican metadata to YAML. However, let us say you have a file called test.md with the following contents:

Title: My super title
Date: 2010-12-03 10:20
Modified: 2010-12-05 19:30
Category: Python
Tags: pelican, publishing
Slug: my-super-post
Authors: Alexis Metaireau, Conan Doyle
Summary: Short version for index and feeds

My first sentence.

You could use the script below to modify the frontmatter into the format expected by Pandoc.

#!/usr/bin/env python3
import re

PATTERN = "^.*?:\s*?.*$"

def main():
    lines = []
    with open("test.md", "r") as file_handle:
        lines = file_handle.readlines()

    index = 0
    new_lines = []
    for index, line in enumerate(lines):
        if re.match(PATTERN, line):
            line.replace(line, line[0].lower() + line[1:])
            new_lines.append(line)
        else:
            break

    new_lines.insert(0, "---\n")
    new_lines.insert(index + 1, "---\n")
    new_lines.extend(lines[index:])

    with open("test2.md", "w") as file_handle:
        for line in new_lines:
            file_handle.write(line)

if __name__ == '__main__':
    main()

This writes the following to a file called test2.md the following lines:

---
title: My super title
date: 2010-12-03 10:20
modified: 2010-12-05 19:30
category: Python
tags: pelican, publishing
slug: my-super-post
authors: Alexis Metaireau, Conan Doyle
summary: Short version for index and feeds
---

My first sentence.

The script above is not the most efficient or robust but it gets the job done. You will have to modify it slightly to fit your environment.

We are in the process of releasing a new version of the plugin which may solve your speed issues.

If you would like to try out the code before we release it, you may download the directory pelican/plugins/pandoc-reader and place it in your plugins directory. You will also need the following dependencies installed locally:

Of course, you will also need Pandoc 2.11 or higher installed.

Hope that helps.

reagle commented 3 years ago

@nandac, thank you for the suggested script, but that would cause me tons of headaches over hundreds of entries. A more robust version would have to isolate it's changes to the first double CR/LF delimited chunk of text, and then be able to handle quotes and colons in the title (e.g., title: "I am a quote": an entry about a quote). Given I fear this will be unusably slow anyway, I was hesitant to invest time in developing something more robust -- and why I asked.

That said, a possible approach would be to simply use pandoc only if the markdown file is pandoc-YAML markdown (e.g. hack, test if the first three non-white space characters are "---") and otherwise use the native parsing. That way, folks don't need to convert their old entries, they can retain the benefit of faster/simpler parsing on simpler entries, and use more sophisticated pandoc markdown when needed.

nandac commented 3 years ago

@reagle I understand how tedious it would be to change hundreds of files.

In terms of speed on a website I am developing that uses citations for some blog posts, it took approximately 0.5 seconds per markdown document. Therefore, 300 files would take approximately 2.5 minutes to process.

Although this version of the plugin does check for a valid Pandoc YAML block it does not have the capability to fall back to native parsing.

@justinmayer is there a way to fall back to native parsing if the expected YAML block is not found?

On Wed, Dec 2, 2020 at 5:40 PM Joseph Reagle notifications@github.com wrote:

@nandac https://github.com/nandac, thank you for the suggested script, but that would cause me tons of headaches over hundreds of entries. A more robust version would have to isolate it's changes to the first double CR/LF delimited chunk of text, and then be able to handle quotes and colons in the title (e.g., title: "I am a quote": an entry about a quote). Given I fear this will be unusably slow anyway, I was hesitant to invest time in developing something more robust -- and why I asked.

That said, a possible approach would be to simply use pandoc only if the markdown file is pandoc-YAML markdown (e.g. hack, test if the first three non-white space characters are "---") and otherwise use the native parsing. That way, folks don't need to convert their old entries, they can retain the benefit of faster/simpler parsing on simpler entries, and use more sophisticated pandoc markdown when needed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pelican-plugins/pandoc-reader/issues/11#issuecomment-737603275, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGSJ4KQU63VWQKZIRMZCZLSS3UCLANCNFSM4UK6BE4A .

kdeldycke commented 3 years ago

Thanks @nandac for the script! I did that kind of transformation some years ago with a couple of one liners. You can find them here: https://kevin.deldycke.com/2006/12/text-date-document-processing-commands/#replace

nandac commented 3 years ago

@kdeldycke Thank you for the link to your blog it looks great.

My sed skills are rudimentary but it is the best tool for this sort of job.

On Thu, Dec 3, 2020 at 1:56 AM Kevin Deldycke notifications@github.com wrote:

Thanks @nandac https://github.com/nandac for the script! I did that kind of transformation some years ago with a couple of one liners. You can find them here: https://kevin.deldycke.com/2006/12/text-date-document-processing-commands/#replace

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pelican-plugins/pandoc-reader/issues/11#issuecomment-737809313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGSJ4PBVE5WKV7MPHGKPY3SS5OFPANCNFSM4UK6BE4A .

reagle commented 3 years ago

@nandac I assume the caching mechanism still works, so if it the ingress can still quickly skip untouched files, do only touched files, quickly distinguish between native and pandoc markdown, it might be worthwhile.

Presently, I do sometimes write entries with bibliographies and I then have to use pandoc to convert them to pelican markdown (with changes to the header). If I then edit, I have to do it again, which is a nuisance, which is why I'm interested in this plugin.

nandac commented 3 years ago

@reagle The plugin does not support Pelican's markdown metadata and strictly only looks for and supports Pandoc's YAML metadata. The first time the plugin is used it will run through every file to convert it to HTML5, so those files that do not follow the YAML metadata format will fail processing and an error will be reported.

Secondly, the plugin does a soup to nuts conversion of Pandoc's Markdown to HTML5 so support for bibliographies is supported as stated in the README.

If you are worried about processing speed, the best thing would be to test out the plugin, by downloading the code base as I stated above, and try it out on one of your files, and see how long it takes.

I may not be understanding your use case in full so perhaps you could give me a minimal example of the types of files you are dealing with and we can take it from there.

nandac commented 3 years ago

@reagle The new version of the plugin has been released on PyPI: https://pypi.org/project/pelican-pandoc-reader/

reagle commented 3 years ago

After installing 1.0, it becomes the parser even though I haven't enabled/specified the plugin in peliconf.py . I have a few blogs, and I might not want this parser to be required on all of them.

nandac commented 3 years ago

@reagle I think the behavior you see is how plugins work in Pelican 4.5 according to the documentation. Merely installing it enables it.

I think you can disable/enable the plugin by tweaking the PLUGINS setting in pelicanconf.py. However, I am not sure how you would conditionally disable/enable the plugin for a subset of blogs.

You might want to get in touch with the Pelican team on the IRC for workarounds.

justinmayer commented 3 years ago

As @nandac accurately mentioned, as of Pelican 4.5, namespace plugin registration occurs automatically upon installation. You can use the PLUGINS setting to instead explicitly list the plugins to be enabled.

reagle commented 3 years ago

If, on the other hand, you specify a PLUGINS setting as a list of plugins, this auto-discovery will be disabled. At that point, only the plugins you specify will be registered, and you must explicitly list any namespace plugins as well.

Thank you for the pointer, but how do you disable auto-discovery when you have a single plugin installed for a different blog, but don't want to use it on the current one? The following configurations still runs the plugin: PLUGINS = []. I tried specifying a non-existent plugin too to no effect: PLUGINS = ["Null"].

avaris commented 3 years ago

Ah... Readers are a bit magical. Merely importing it causes them to be activated (kind of), without actually registering them[*]. pelican currently imports all namespace plugins as part of discovery: to know what is available and register the necessary ones.

If you have multiple blogs with different requirements, I'd suggest using virtualenvs and different installations of pelican (+plugins).

[*] That behavior may be changed.

reagle commented 3 years ago

I've converted my two blogs to use this plugin and am so pleased. Thank you! pel2pan.py is a fairly robust script for converting the metadata.

reagle commented 3 years ago

BTW: If you wanted to include pel2pan.py in this repo, feel free.

nandac commented 3 years ago

@reagle I was thinking of linking to your code from the README as it is not really part of the plugin code That way users can use it separately from the plugin code.

Your thoughts?

On Mon, Dec 14, 2020, 11:00 Joseph Reagle notifications@github.com wrote:

BTW: If you wanted to include pel2pan.py in this repo, feel free.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pelican-plugins/pandoc-reader/issues/11#issuecomment-744643332, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGSJ4NK7KTA4R5WLUSZI4TSUZOFZANCNFSM4UK6BE4A .

reagle commented 3 years ago

It might be easier to take issues for and maintain it here, but either way is fine. If you want it here, just let me know, I'm happy to switch the license.