vsch / flexmark-java

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
BSD 2-Clause "Simplified" License
2.29k stars 271 forks source link

How to self modify parse method when htmltomarkdown #353

Open gaofeiseu opened 5 years ago

gaofeiseu commented 5 years ago

Is your feature request related to a problem? Please describe. Hi, I come from China, flexmark is really good tools, during my development, I found some problem. I need to convert html to markdown.But when I convert, some tag in html has unusual src like this <img src="//img.alicdn.com/tfscom/TB1mR4xPpXXXXXvapXXXXXXXXXX.jpg" > such src cannot convert to markdown and behavior correct.

Describe the solution you'd like how can I modify src parse method in img tag in a extension options way. And get result like this <img src="//abc.com/cde/efg.jpg" > convert to ![](https://abc.com/cde/efg.jpg)

Describe alternatives you've considered some extension options or already has some options I just ignore?

Additional context

vsch commented 5 years ago

@gaofeiseu, I tried the HTML you gave and the converted markdown seems to be correct, the markdown is the first line the HTML is last:

![](//img.alicdn.com/tfscom/TB1mR4xPpXXXXXvapXXXXXXXXXX.jpg)
.
<img src="//img.alicdn.com/tfscom/TB1mR4xPpXXXXXvapXXXXXXXXXX.jpg" >

Can you create a small test with options which does not work for you?

You can use the sample as a starting point and add the configuration you use in your code:

HtmlToMarkdownSample.java

vsch commented 5 years ago

@gaofeiseu, sorry, I just realized what you really wanted was to add "https:" prefix to the image URL if it is missing.

The easiest way to do this in the current implementation is to use the standard HTML parser to get the Markdown, then parse the Markdown and replace the URLs in the AST with what you want before passing the AST document node to formatter, which will output the changed Markdown.

The sample FormatterWithMods.java shows how to change the URLs in the AST so that the formatted Markdown has replaced URLs.

All you need to do is replace the logic in FormatterWithMods.java: Lines 68-71 with:

            if (node.getPageRef().startsWith("/")) {
                node.setUrlChars(PrefixedSubSequence.of("https:", node.getPageRef()));
                node.setChars(SegmentedSequence.of(Arrays.asList(node.getSegmentsForChars())));
            }

To have all URLs starting with / prefixed with https:

gaofeiseu commented 5 years ago

@vsch thanks a lot for your patient!Use standard HTML parser to get Markdown from HTML content is what I have done.You mean I need continue to parse Markdown to HTML and replace URLs with similar method you had gave in demo code:FormatterWithMods.java. Then I still need to parse the HTML content after replacing to Markdown? I agree this will be a solution, but as you see, too many convert between HTML and Markdown I need to do in this solution. Is there other solution, more light weight, less convert, direct from HTML to markdown

vsch commented 5 years ago

@gaofeiseu, what you need to do is simply combine HTML to Markdown then parse the Markdown to AST, replace the URLs in the AST and render the AST as Markdown using the formatter. It is combining the two samples I mentioned into a single process.

If you take the modified FormatterWithMods you can see the needed steps: FormatterWithMods2.java

The current version of HTML to Markdown implementation is not extensible so there is no easy way to modify the markdown it generates. I am working on a new version that supports extensions similar to HTML Renderer and Markdown Formatter which will allow some customization to generated Markdown without needing to re-parse the markdown but this is not yet available.

vsch commented 5 years ago

@gaofeiseu, new module with extension API for HTML to Markdown conversion implemented.

See #313, last comment has a link to a sample which modifies some link URLs during conversion.