leeoniya / reMarked.js

client-side HTML > markdown
http://leeoniya.github.io/reMarked.js/
396 stars 97 forks source link

Add support for html tags if possible. #26

Closed ghost closed 11 years ago

ghost commented 11 years ago

I tested inline divs and iframes and remarked does not support them. Github flavored markdown allows inline html tags to be used and it would be nice if remarked supported it as well seeing as it already supports gfm tabes.

leeoniya commented 11 years ago

this is an interesting issue. i agree that it would be good to just output the html for unsupported tags in some cases. especially short, inline ones like canvas,video,audio currently excluded here: https://github.com/leeoniya/reMarked.js/blob/master/reMarked.js#L158

"inline divs" makes no sense. divs are block level elements, they are not inline by definition. the situation with block-level elements is more complicated since they can be nested and are often long. how would you propose to render this markup:

<div class="abc" data-foo="def">
    <p>the quick brown fox jumps over the lazy dog</p>
</div>

or

<div class="abc" data-foo="def">
    some text
    <p>the quick brown fox jumps over the lazy dog</p>
    some other text
</div>

does this mean the paragraph tags will get rendered as HTML because it's inside a <div>? i am opposed to analyzing the contents of unsupported tags to render portions as markdown within wrapped html tags. the whole thing would turn into an improperly indented, misaligned mess.

keep in mind, markdown is designed for fairly flat documents. it is not meant to replace complex html. while all markdown can be converted to valid html, not all html produces good, readable markdown; the whole point of ditching html for md is readability, which would rapidly degrade if all unrecognized innerHTML was dumped into the output.

leeoniya commented 11 years ago

can you provide a full example of your html and what you expect the output to be?

ghost commented 11 years ago

i am opposed to analyzing the contents of unsupported tags to render portions as markdown within wrapped html tags

Exactly. This is just what I am saying. For instance consider this tag

<iframe width="640" height="480" src="//www.youtube.com/embed/8YRdxHHFKvQ" frameborder="0" allowfullscreen>

</iframe>

Markdown does not support iframes. So skip analyzing it and render it as it is. Same for the divs.

What I am saying is whenever you encounter an unsupported item stop analyzing any further and just give back the content.

But things can get more complicated. For example

<blockquote>

<p> I am a harmless paragraph</p>

<div>I am an unsupported division. <p>I am another harmless paragraph</p></div>

</blockquote>

Now remarked can keep the same logic for this tag. It will analyze the blockquote. See the supported paragraph and render it. Find that the div is unsupported and leave it as it is. So the final out put could be

>  I am a harmless paragraph. <div>I am an unsupported division. <p>I am another harmless paragraph</p></div>

I think that any user who is using html inside markdown does not expect it to be converted anyway. I don't know if this stuff can be easily implemented or not though.

ghost commented 11 years ago

By the way thank you so much for this library. It is by far the best reverse markdown library that I have encountered. I am using it extensively in my own project what you wrote was a life saver.

leeoniya commented 11 years ago

glad you're finding it useful. just outputting the innerHTML of unsupported tags is easy to add but as long as we're discussing this issue, i'd prefer to make changes that are more tweakable and less prescriptivist. here are some ideas that came to mind.

i would prefer to have childNodes of unsupported tags still get parsed and converted to markdown. so...

<div class="styled">
    <p>the quick brown fox jumps over the lazy dog</p>
    <p>the quick brown fox jumps over the lazy dog</p>
</div>

would become something like

<div class="styled">
    the quick brown fox jumps over the lazy dog

    the quick brown fox jumps over the lazy dog
</div>

and

<span class="styled">the <em>quick</em> brown fox jumps over the lazy dog</span>

to

<span class="styled">the *quick* brown fox jumps over the lazy dog</span>

but also i want to leave it up to the user, so i'm thinking of adding something like this to the config:


/* handling of unsupported tags
    0 - ignore, no output
    1 - output full innerHTML
    2 - assert inl/blk, parse kids, retain own tags/attr
    3 - assert inl/blk, parse kids
    "p","tblk","inl"...etc - remap to internally defined type
*/
unsup_tags: {
    "*":      2,    // default
    "script": 0,    // unscripted
    "style":  0,    // no style
};

// use getComputedStyle instead of hardcoded tag list to discern block/inline
comp_style: false,

i've been toying with using getComputedStyle to assert inline or block rather than the current hardcoded taglist in a regex. this has the benefit of handling all current and future html tags as they are displayed, accounting for any user css adjustments.

but it requires a deeper DOM dependency with a layout engine (which might be bad for node) and certainly has worse performance (though it's likely not anything noticeable). also, chrome needs you o physically insert the nodes into the document for this to work, which adds overhead if the input is a string of html

ghost commented 11 years ago

Well it would be nice to be able to tweak the parsing of unsupported tags. But it should not come at the cost of making things murky in the code. If you have a simple straightforward way to implement this the this feature would be wonderful.

I also wanted to bring your attention to this library it uses jquery as a dependency and I think using a dom manipulation library could save you a lot of headache.

I say go for the simplest solution first. If it is simpler to just throw away html do it. Once that is done and tested think about tweaking the behavior and providing options.

leeoniya commented 11 years ago

i've seen to-markdown but haven't had a chance to test how it performs compared to reMarked, it's certainly much smaller if you disregard the elephant in the room (jquery). looking at the source, it uses a lot of long regexs to detect nested matching tags, which is always trouble. from the demo page it performs reasonably well on simple cases, but trips up on more complex examples.

there's absolutely no reason to have jquery as a reMarked.js dependency. the DOM walking that's currently done takes up 5 lines of loop code (< 100 bytes) and performs faster than jquery.

i'll look into implementing unsupported tags behavior in the next few days.

ghost commented 11 years ago

Yea to-markdown also does not support github flavored markdown. Thanks a lot for considering this request.

leeoniya commented 11 years ago

so this should now be possible, though the default config may not always give you what you're looking for. you'll notice additional options for how unsupported tags are handled:

// handling of unsupported tags, defined in terms of desired output style. if not listed, output = outerHTML
unsup_tags: {
    // no output
    ignore: "script style noscript",
    // eg: "<tag>some content</tag>"
    inline: "span sup sub i u b center big",
    // eg: "\n\n<tag>\n\tsome content\n</tag>"
    block2: "div form fieldset dl header footer address article aside figure hgroup section",
    // eg: "\n<tag>some content</tag>"
    block1c: "dt dd caption legend figcaption output",
    // eg: "\n\n<tag>some content</tag>"
    block2c: "canvas audio video iframe",
}

all tags you see listed (except those in ignore) will parse and attempt to convert childNodes to markdown and render them as their grouping indicates. tags that are not listed here (and also not supported by markdown) are output as outerHTML without parsing children. these unknown elements will be assumed to be block-level and get rendered on their own lines.

if you prefer for all unsupported tags to remain unprocessed, simply pass an override in the config:

unsup_tags: {
    ignore: "script style noscript",
    inline: "",
    block2: "",
    block1c: "",
    block2c: "",
}

this will output outerHTML of every unknown tag on a new line.

ghost commented 11 years ago

Thank you. So I am closing this issue.