vsch / flexmark-java

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
BSD 2-Clause "Simplified" License
2.29k stars 271 forks source link

Jekyll-style includes completely broken #404

Open nanodeath opened 4 years ago

nanodeath commented 4 years ago

Describe the bug As far as I can tell, Jekyll includes ({% include foo %}) aren't tested and don't work actually work. transferReferences work, but the content isn't actually moved over.

To Reproduce

The provided sample file here reproduces the issue.

  1. Options used to configure the parser, renderer, formatter, etc. Please provide concise code when you can.

Per the example, this configuration should be sufficient:

options.set(Parser.EXTENSIONS, Arrays.asList(JekyllTagExtension.create()));

Expected behavior

When I {%include%} another file, I expect all the content from that file to be inlined into the include site in the output.

In the case of the example, the word "Included" should show up somewhere in the output.

Resulting Output

<p>http://github.com/vsch/flexmark-java</p>
<p><a href="http://example.com">ref</a></p>

Additional context

I saw this issue originally in my own code using "com.vladsch.flexmark:flexmark-all:0.61.16", and also the latest in the repo, by downloading the entire thing as a zip, at https://github.com/vsch/flexmark-java/commit/bdf33bb8570e08b9a014184ebdfc00058dfe093c.

vsch commented 4 years ago

@nanodeath, I forgot to update the JekyllIncludeFileSample.java to reflect the new implementation of Jekyll include tag processing.

Now embedding of content is done during parsing by IncludeNodePostProcessor.java. This means that include content map must be available before the document is parsed.

Effectively, if the embedding is done via a map which is computed from the document AST then the document needs to be parsed twice. Once to compute the map, then again with the defined map to embed the content.

If the include processing is done via a LinkResolver(s) which map include paths to absolute file://... URLs then the file content will be embedded from the file system, without the need to parse twice.

As an alternative, content embedding can be done without the double parsing by duplicating the processing in the IncludeNodePostProcessor.java which appends new child nodes to the JekyllTagBlock to hold the embedded content.

The change in processing was done to allow other (non-HTML) renderers to easily handle embedded content, since the original mechanism only supported embedding HTML.

vsch commented 4 years ago

@nanodeath, JekyllIncludeFileSample.java above is an updated working example which uses html content map.

vsch commented 4 years ago

@nanodeath, I am considering adding another extension point API: UriContentProvider to provide content for URLs. At the moment file:// URI is handled by code which needs content.

DocxRenderer uses image content for embedding and conversion, JekyllTagExtension uses it for embedding.

This is not reusable and hard-coded. Having this extension point will make it easy to handle any type of content including dynamic.

nanodeath commented 4 years ago

Sorry, commented from my work account, so I deleted that. I'll repeat my reply to your original comment here for some semblance of continuity:


This means that include content map must be available before the document is parsed.

I actually saw this and was poking through the source code trying to get this approach to work, but ended up giving up -- I could see the IncludeNodePostProcessor was getting picked up, but then never actually invoked.

Can you give me a simple example that should work?

nanodeath commented 4 years ago

The end goal here is to replicate Jekyll-style "layouts" where there's a common header/footer used across all pages that include in different content depending on the article being rendered, if that helps any :)

vsch commented 4 years ago

@nanodeath, If you create an HTML content map without examining what files the document actually includes then you can create the content map and set the options before creating the parser.

In your case, all included files are located in the _includes directory.

Since the include tag does not use a path, references from all markdown documents do not depend on the relative path of the document.

Compute the map of file name to rendered file content in for all files in the _includes directory. You might have to create aliases with and without an extension (not sure since I do not work with Jekyll and my knowledge is based on what I read in the documentation).

Then set options used to create the parser, includeHtmlMap is Map<String, String> from file name in the include tag to the HTML content of the file:

 options.set(JekyllTagExtension.INCLUDED_HTML, includeHtmlMap);
 options.set(JekyllTagExtension.EMBED_INCLUDED_CONTENT, true);
vsch commented 4 years ago

@nanodeath, I added a modified JekyllIncludeFileSample2.java which computes the html content map in advance and only parses the document once.

nanodeath commented 4 years ago

Thanks @vsch, I'll give it a try. I think in my most recent experimenting I missed the EMBED_INCLUDED_CONTENT option.

nanodeath commented 4 years ago

Eyyy, I finally got it working. Your JekyllIncludeFileSample2 example worked fine for me, and eventually I got my code working too.

Besides EMBED_INCLUDED_CONTENT, I was also missing extra newlines around the include -- looks like it requires at least two newlines before and after. I'm guessing this is necessary for it to be parsed as a JekyllTagBlock/JekyllTag? It's a bit surprising behavior that without the newlines, the include is silently omitted from the output -- no error, no {%include%}, no included content.

Anyway, thanks for your help and quick turnaround.

vsch commented 4 years ago

@nanodeath, I will take a look at why inline include tags are silently ignored. They should either be left as text or handled as includes. Could be a bug.

vsch commented 4 years ago

@nanodeath, can confirm that inline include tag implementation was broken. Now fixed.

For inline include tags, If included file contains a single paragraph then the paragraph is unwrapped so only the text is included. Otherwise, the included file AST is included as is, which may result in invalid HTML if included file has block elements since these will be included in the p tag of including element.

:information_source: when include tag embedding is enabled, the included file elements will be added as child nodes of the JekyllTag node, not JekyllTagBlock node as was done in previous versions.

Fix for this is available. Repo updated, maven updated but may take a while to show up in maven central.