mondeja / mkdocs-include-markdown-plugin

Mkdocs Markdown includer plugin
Apache License 2.0
111 stars 20 forks source link

Add new argument `recursive` to `include` directive #207

Closed carlocastoldi closed 6 months ago

carlocastoldi commented 6 months ago

If the any of the markdown—both the ones checked for tags and the imported ones—have very long line of text (i.e. <100_000 characters), the compiled regex takes too much time (minutes) to check the file resulting in the build process halting.

This happens, in my case, when embedding in my markdowns some plotly html figures (saved with full_html=False, include_plotlyjs='cdn').

However, this can be easily tested with the following code as well:

import re
import string

DOUBLE_QUOTED_STR_RE = r'([^"]|(?<=\\)["])+'
SINGLE_QUOTED_STR_RE = r"([^']|(?<=\\)['])+"

# In the following regular expression, the substrings "$OPENING_TAG"
# and "$CLOSING_TAG" will be replaced by the effective opening and
# closing tags in the `on_config` plugin event.
INCLUDE_TAG_RE = rf'''
    (?P<_includer_indent>[ \t\f\v\w{re.escape(string.punctuation)}]*?)$OPENING_TAG
    \s*
    include
    \s+
    (?:"(?P<double_quoted_filename>{DOUBLE_QUOTED_STR_RE})")?(?:'(?P<single_quoted_filename>{SINGLE_QUOTED_STR_RE})')?
    (?P<arguments>.*?)
    \s*
    $CLOSING_TAG
'''  # noqa: E501

def create_include_tag(
        opening_tag: str, closing_tag: str, tag: str = 'include',
        flags: re.RegexFlag = re.VERBOSE | re.DOTALL
) -> re.Pattern[str]:
    """Create a regex pattern to match an inclusion tag directive.

    Replaces the substrings '$OPENING_TAG' and '$CLOSING_TAG' from
    INCLUDE_TAG_REGEX by the effective tag.
    """
    return re.compile(
        INCLUDE_TAG_RE.replace(' include', f' {tag}').replace(
            '$OPENING_TAG', re.escape(opening_tag),
        ).replace('$CLOSING_TAG', re.escape(closing_tag)),
        flags=re.VERBOSE | re.DOTALL,
    )

pattern = create_include_tag("{%", "%}", tag='include-markdown')
long_as_s__t = "a"*200_000
pattern.search(long_as_s__t[:10_000])    # fine
pattern.search(long_as_s__t[:100_000])   # NOT fine!

Temporary solution

As a temporary solution i modified locally event.py/get_file_content() as such to just exclude lines longer than 10_000 characters. Obviously this is a bad implementation, but it could serve as a blueprint for a new option for the plugin!

def get_file_content([...]) -> str:
    """Return the content of the file to include."""
    def found_include_tag([...]) ->str:
    [...]
    def found_include_markdown_tag([...]) ->str:
    [...]

    markdown_lines = markdown.splitlines()
    long_lines_indices = [(i,line) for i,line in enumerate(markdown_lines) if len(line) > 10_000]
    for i,_ in sorted(long_lines_indices, reverse=True, key=lambda t: t[0]):
        markdown_lines[i] = "~~~REMOVED LINE~~~"
    markdown = "\n".join(markdown_lines)

    markdown = tags['include'].sub(
        found_include_tag,
        markdown,
    )
    markdown = tags['include-markdown'].sub(
        found_include_markdown_tag,
        markdown,
    )

    markdown_lines = markdown.splitlines()
    for i in range(len(markdown_lines)):
        line = markdown_lines[i]
        if line == "~~~REMOVED LINE~~~":
            _,markdown_lines[i] = long_lines_indices.pop(0)
    return "\n".join(markdown_lines)
mondeja commented 6 months ago

This happens, in my case, when embedding in my markdowns some plotly html figures...

What about including these figures with the include directive instead of embedding them in Markdown files? As far as I remember, it will cause regular expressions not to be executed against that content.

carlocastoldi commented 6 months ago

mmmh no, that doesn't appear to have any positive effects :\

mondeja commented 6 months ago

What about a recursive argument to limit the number of calls to the internal recursivity? That would allow to call {% include 'your-long-file.ext' recursive=false %} and will stop trying to search for more inclusions in nested files, so the regex will not be applied to the included content.

carlocastoldi commented 6 months ago

Mmmh no.

Not even swapping the substitution so that it first does found_include_markdown_tag (resulting in a markdown with no long lines since there are no include-markdown tags) and then found_include_tag.

Are you sure that this recursive flag exists in the latest release? I can't find any occurrency in the files 🤔

mondeja commented 6 months ago

Sorry for the misconfusion, I'm wondering if something like implementing a recursive argument for directives would solve this issue. Currently, does not exists.

carlocastoldi commented 6 months ago

Oh sorry, i misread! But yeah. That sounds like a perfect idea. Just be sure to allow it only for one of the two tags, and apply it as the last substitution. Otherwise it would get stuck anyway on the second one

carlocastoldi commented 6 months ago

I implemented it, but now it seems to conflict with mkdocs-jupyter. Do you have any idea why it didn't previously and now it does?

I never import a .ipynb with include-markdown, and yet this change seems to affect it :o

EDIT: you can check it here https://github.com/mondeja/mkdocs-include-markdown-plugin/commit/004871ed2b745d775e41d13ef02122c7d30f3041

mondeja commented 6 months ago

I implemented it, but now it seems to conflict with mkdocs-jupyter. Do you have any idea why it didn't previously and now it does?

Looks good. What do you mean with "conflict with mkdocs-jupyter"?

carlocastoldi commented 6 months ago

as in, with my previous dumb implementation (https://github.com/mondeja/mkdocs-include-markdown-plugin/compare/master...carlocastoldi:mkdocs-include-markdown-plugin:longass_lines_fix) i could have mkdocs-jupyter installed and include in my docs/ folder an .ipynb with no issues and have it as a page on the right menu.

With the recurse fix, instead, mkdocs serve successfully processes and produces the markdown page with the included Plotly figure, but ends up getting stuck on INFO - Processing page "notebook.ipynb"...

carlocastoldi commented 6 months ago

could it be that, for some reason, mkdocs-jupyter triggers include-markdown again on the newly substituted markdown text, thus forcing it to process again a very long text anyway?

mondeja commented 6 months ago

Not sure. You can add logger statements to see if mkdocs-include-markdown-plugin is including again.

carlocastoldi commented 6 months ago

Yup.

include-markdown processes also the notebook. However if the notebook has saved (as often is wanted) a plotly figure, the output lines are too long and its back at the same problem. However this time it's directly on a ~primary page and not on a linked page

Do you see a possibility where an user of mkdocs-jupyter might want to use include-markdown to modify a jupyter markdown block? Could we start by avoiding processing .ipynb files all together?

mondeja commented 6 months ago

Released on v6.1.0. Thanks for your work, let me know if you still experiment these kind of problems.

carlocastoldi commented 6 months ago

I'm sorry, but i don't really understand why you reverted this change in 2582dd534d273653706b5524fbe4f635959bb25c

The whole reason I contributed was so that i could fix this issue for everybody else as well and not just me. If you keep the order of execution of substitutions, however, the error persists as well explained in https://github.com/mondeja/mkdocs-include-markdown-plugin/pull/208#discussion_r1619477020

Thank you for adding tests and translations, tho!

mondeja commented 6 months ago

Ah, OK. I was doubting why that change was kept, I believed that you forgot to revert it because that problem would be fixed by using exclude. Sorry for my confusion, I'll release a patch ASAP.

carlocastoldi commented 6 months ago

It's okay. Thanks a lot for this amazing work you're doing, because it's such a handy tool you've built and maintained!