markdown-it-rust / markdown-it

markdown-it js library rewritten in rust
Other
79 stars 9 forks source link

Rewrite image links in markdown #36

Closed Nutomic closed 10 months ago

Nutomic commented 10 months ago

Hi, Im the maintainer of Lemmy which is a web server for social media. In this project users can submit markdown posts through the API, and also retrieve posts in markdown format. This includes support for embedded images.

Now we need a way to rewrite markdown image links in order to go through a local proxy, and avoid users connecting directly to external servers for privacy reasons. Essentially it requires a way to parse markdown, alter image links and then write to markdown again.

Is such a functionality possible with markdown-it? Or is there another library which is preferable for this use case?

rlidwka commented 10 months ago

Essentially it requires a way to parse markdown, alter image links and then write to markdown again.

At this point, we can't write back to markdown, sorry. See https://github.com/markdown-it-rust/markdown-it/issues/30 for details.

But you don't need to write back to markdown. You only need to replace links. That's a much easier problem.

You can take any library that supports char-by-char source maps (this one does, as well as others), then take original text and replace links at a given position.

Quick and dirty example is here:

use markdown_it::parser::inline::Text;
use markdown_it::plugins::cmark::inline::image::Image;

fn main() {
    let mut src = String::from("any markdown ![link](url) here, and ![another](url) here");
    let md = &mut markdown_it::MarkdownIt::new();
    markdown_it::plugins::cmark::add(md);

    let ast = md.parse(&src);
    let mut links = vec![];

    ast.walk(|node, _depth| {
        // this is wrong, but it's just a demo
        // unhandled case 1: children are empty, ![](link)
        // unhandled case 2: many children, ![*hello* bar](link)
        if node.is::<Image>() && node.children.len() == 1 && node.children[0].is::<Text>() {
            let outer_srcmap = node.srcmap.unwrap();
            let inner_srcmap = node.children[0].srcmap.unwrap();

            links.push((inner_srcmap.get_byte_offsets().1 + 2, outer_srcmap.get_byte_offsets().1 - 1));
        }
    });

    while let Some((start, end)) = links.pop() {
        src.replace_range(start..end, "XXX");
    }

    println!("{}", src);
    // any markdown ![link](XXX) here, and ![another](XXX) here
}

Be aware, that markdown is not a strict standard. Upstream parser (that converts to html) might have slightly different behavior or more extensions. Thus, you have a risk of carefully crafted image links slipping through unreplaced. For security-critical cases, you have to have this replacement be done at convert-to-html level.

Nutomic commented 10 months ago

Thank you that example is very helpful. I changed it slightly to use Image.url for calculating offsets. That way it seems to work very reliably based on the test cases. You can see the code here, I would appreciate if you could have a look. Specifically I would like to know if there are more cases to test for, and in which cases the Node.srcmap can be None.

rlidwka commented 10 months ago

and in which cases the Node.srcmap can be None

For image rule - never (in fact, there's a test that checks presence of sourcemaps in all tests).

In general case, source maps won't make sense for nodes that are created out of thin air after parsing is complete. For example, auto-generated list of footnotes at the bottom doesn't have any corresponding source text. None of these are implemented currently.

Nutomic commented 10 months ago

Great, thanks!