mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
779 stars 146 forks source link

Entities in non-trivial contexts #5

Closed mity closed 7 years ago

mity commented 7 years ago

Entities which are not inside a normal text flow are not translated.

This includes these situations:

It is responsible for following CommonMark 0.27 test failures.

mity commented 7 years ago

This issue is also partially related to the discussion jgm/Commonmark#442.

mity commented 7 years ago

Possible solutions:

Option 1: Change API

Change API of MD4C so that renderer provides an entity translation (not rendering) callback which can be called any time, and which provides the translated characters back to the parser, who then may send them back to the renderer through a MD_RENDERER::text() callback as well as any string in a detailed structure like e.g. MD_SPAN_A_DETAIL::title.

Advantages:

Disadvantages:

Option 2: Ignore the issue

Rationale: I don't currently think there is a lot of people or documents who really use entities inside those contexts.

Advantages:

Disadvantages:

Option 3: Make the renderer do all the work

Advantages:

tin-pot commented 7 years ago

I tend to think that option 2 is actually the best - and not (only ;-) because it is the simplest to implement.

  1. Entity (and character) references in attribute value literals inside HTML tags are untouched by the processor anyway, according to the spec (as they should, IMO). Thus any subsequent processor of CommonMark's output, like a browser, must be prepared to deal with them in any case. For text that ends up in such attribute value literals, like the link title, I see thus not much value in replacing these references.

  2. What the info string is and means, and how to interpret it, is highly dependent on the local setup. I doubt that fancier characters (not available on a local keyboard) are of much use there, at least for typical uses, like to indicate a progamming language to guide syntax highlighting, but I could be wrong of course. In any case, the info string would probably end up in a CDATA attribute as well, so the interpretation rules apply here as well—this could be a reason to protect & (and, for XML, < too) when placing the info string in an output value literal, ie, treat the info string like a code span.

  3. On the issue of URL strings I'm not sure: There's already the URL-escaping mechanism in place, a backslash is treated as non-significant (Ex. 471), percent signs are treated specially (Ex. 472), and so on. Bringing reference substitution into this game seems to add more complexity (for users and implementors) than it solves real needs. Again, I may be wrong—what would be a good use case for it? Note also that URLs frequently contain syntactically important & characters right in front of letters, that the user might want to type.

    But, and that's probably the most common use case: if the user wants to copy-paste a URL into the CommonMark text, rules applicable to link destination text should not hinder this, I think.

mity commented 7 years ago

Well. I think I almost made my mind about this. But lets see if anyone has some more comments or arguments about it.

I generally agree with @tin-pot, this is quite strange property of CommonMark and I almost consider it as a mis-feature of the standard, especially the argument with copy&paste of an URL is very convincing to me.

Additionally I do not like make the MD4C interface way more complicated because of a corner case like this as that would make both MD4C and all renderers more complicated even if they don't want to handle it.

On the other side, standards are standards and there is a value of its own if we can pass its testsuite.

Therefore I think I will implement it in md2html utility alone, without a change of the parser. I even think this might not be a default behavior of md2html and it would have to be turned on with a new command line option (e.g. --fpedantic) and that option would the be used from scripts/run-tests.sh.

Opinions?

tin-pot commented 7 years ago

On the other side, standards are standards and there is a value of its own if we can pass its testsuite.

On the other hand … No, just kidding. But the discussion about entity references in CommonMark is not settled yet, so maybe it's not yet time to pull one's hair out about conformance in this regard (unless these are the only remaining conformance test failures?).

I find the partition of responsibility between the MD4C parser itself and the md2html utility quite reasonable as it is right now.

If I understand you right, you intend to do scanning for and replacement of entity (and character?) references inside md2html, for certain strings (only) that arrive from the parser, like MD_SPAN_A_DETAIL::title? That should not be too hard to do.

Or the parser could delimit references in these strings not with the usual ERO (&), CRO (&#), REFC (;) delimiters, but with, say, some single-character delimiters from the Unicode PUA set? (Or even control characters from the C0 set like U+000E SHIFT OUT/U+000F SHIFT IN.) This way, no special "escapes" like &amp; would be needed, and md2html could just use strstr() (rsp. wcschr()) to find these delimiters. I used similar conventions in my cm2doc tool and found it quite practical.

mity commented 7 years ago

unless these are the only remaining conformance test failures?

Currently, there are just 6 failures. 5 of them is this issue, the last on is covered by issue #6. That is also controversial and IMHO also quite a corner case.

If I understand you right, you intend to do scanning for and replacement of entity (and character?) references inside md2html, for certain strings (only) that arrive from the parser, like MD_SPAN_A_DETAIL::title? That should not be too hard to do.

Exactly. Ditto for MD_SPAN_A_DETAIL::href, MD_SPAN_IMG_DETAIL::src, MD_SPAN_IMG_DETAIL::title and MD_BLOCK_CODE_DETAIL::info (or ::lang as its part as the ::info is not used anywhere in md2html).

mity commented 7 years ago

Or the parser could delimit references in these strings not with the usual ERO (&), CRO (&#), REFC (;) delimiters, but with, say, some single-character delimiters from the Unicode PUA set? (Or even control characters from the C0 set like U+000E SHIFT OUT/U+000F SHIFT IN.) This way, no special "escapes" like &amp; would be needed, and md2html could just use strstr() (rsp. wcschr()) to find these delimiters. I used similar conventions in my cm2doc tool and found it quite practical.

I don't think so. Renderer who does not want to do the translation in those context should not get the string modified in such way and MD4C should not ask renderer what it wants to do with the string.

mity commented 7 years ago

But your comments make me to wonder whether the outlined solution is really reasonable. Renderer does not know whether it can replace any &...; because it does not know whether the & was or was not escaped as MD4C makes escape sequence resolving at least in some of those contexts...

tin-pot commented 7 years ago

Renderer who does not want to do the translation in those context should not get the string modified in such way and MD4C should not ask renderer what it wants to do with the string.

Fair enough. I thought of it as part of the interface contract, just like it is now that references in content are passed to the application using the (*text) callback with the MD_TEXT_ENTITY enumerator. A renderer would have to decorate this with & and ; in any case (or something equivalent), if it does not want to do the translation: those references are not just the same as text. And IMO the same holds for references inside text that is passed as MD_CHAR* in one piece.


[Next comment by @mity]

Renderer does not know whether it can replace any &...; because it does not know whether the & was or was not escaped as MD4C makes escape sequence resolving at least in some of those contexts...

If I understand this remark correctly, this was my point above: deciding which occurrences of & in which strings actually do start a reference is harder than just finding a dedicated character.

But you seem to mean with "context" a whole string: for each of the various MD_CHAR * items that a renderer may evaluate, it should somehow, somewhere be specified whether or not that string could contain references that should be replaced (by the renderer or whatever tool later, like the browser).

And if no replacement should be performed for some of these strings (maybe for URL's), a renderer would still have to "HTML-escape" those strings before placing them into attribute value literals, like the one for the href or src attributes, right?

mity commented 7 years ago

Renderer does not know whether it can replace any &...; because it does not know whether the & was or was not escaped as MD4C makes escape sequence resolving at least in some of those contexts...

If I understand this remark correctly, this was my point above: deciding which occurrences of & in which strings actually do start a reference is harder than just finding a dedicated character

Maybe a little bit hidden as implicit in your comment, but yes, I guess so.

But you seem to mean with "context" a whole string: for each of the various MD_CHAR * items that a renderer may evaluate, it should somehow, somewhere be specified whether or not that string could contain references that should be replaced (by the renderer or whatever tool later, like the browser).

By "context", I mean both (1) where the entity is in the input with respect to various Markdown syntax constructs and (2) how it is propagated to the renderer. But note these map to each other 1:1.

But that brings me to a thought that we actually have two closely related, but still distinct issues here:

  1. The API is not ready to propagate an entity (or information about it) in any of those non-normal-text situations.
  2. Application/renderer may or may not want to perform the translation in some of the contexts but behave differently in other context.

For example I can very well imagine a renderer which may not want to do it in the case of URL (given your argument with copy&paste of it into Markdown document) but at the same time to do it in other situations, e.g. in the case of the link title. (Why Markdown document author should be forbidden to use it there?)

And if no replacement should be performed for some of these strings (maybe for URL's), a renderer would still have to "HTML-escape" those strings before placing them into attribute value literals, like the one for the href or src attributes, right?

Yes but I see this as unrelated. HTML-escaping is always the responsibility of a renderer (who does HTML output). Similarly a renderer into other format may need to escape other character sequences colliding with its syntax. This is purely on the output/renderer side.

In contrast, the entity, as defined by CommonMark, is on the input side and must be therefore in some way understood and handled by the parser (yet ideally providing a renderer enough flexibility and liberty to deal with it as it wishes.)

EDIT: By "understood and handled" I mean "recognized and the info about it propagated to the renderer". The parser should IMHO not do the translation itself.

tin-pot commented 7 years ago

But that brings me to a thought that we actually have two closely related, but still distinct issues here:

  1. The API is not ready to propagate an entity (or information about it) in any of those non-normal-text situations.
  2. Application/renderer may or may not want to perform the translation in some of the contexts but behave differently in other context.

That's exactly the heart of this matter: References may occur in "normal text" (where they are transmitted using a single call to (*text) with the MD_TEXT_ENTITY enumerator to indicate "here's the name of a referenced entity"), while in other texts (transmitted as a MD_CHAR[] string) they must be delimited in some other way and detected "actively" by the renderer.

However, in the latter case the text is otherwise "flat"—it is just a sequence of characters with the occasional reference in it. That's why I find it appropriate to send the whole text string in one go (with a single callback), instead of having multiple callback functions called sequentially.

Another API design could accompany the MD_CHAR[] string with a second array size_t[], which holds pairs of (offset, length) values to point out where the entities are in the text string (in ascending offset order). In this design, a renderer not interested in replacing or otherwise processing these references could just ignore the size_t[] array. On the other hand, actually visiting and processing the references would be very easy and fast if the renderer wishes to do so.

mity commented 7 years ago

Ok. Going to think more about it.

But if we change the API in a similar way, it should IMHO provide more flexibility/extendability.

Imagine e.g. a possible extensions which might want to perform similar kind of translation of other kinds of things. Consider e.g. SmartyPants, some kinds of emoticons/emoji support etc.

mity commented 7 years ago

Also I tend to think we should then remove MD_TEXT_ENTITY and treat it in the same way everywhere so the API should provide reasonable and easy to understand interface for it.

tin-pot commented 7 years ago

Ok. Going to think more about it.

But if we change the API in a similar way, it should IMHO provide more flexibility/extendability.

Just as some kind of context (from where I come from), the same issue comes up with regard to the API or serialization of parsed SGML documents: the parser must transmit CDATA attribute values in such a way that the application can see what was once an SDATA entity reference:

For CDATA attributes, references to SDATA entities in attribute value literals are resolved. The replacement text is distinguished from the surrounding text and identified as an individual SDATA entity.

That closely mirrors our issue here (an SDATA entity is just a special kind of general entity). And, for example, the YASP parser API uses a construct similar to what I proposed above (I just looked it up a bit later): basically, together with a character buffer (the MD_CHAR[] array in our case), a "descriptor" array is provided which consists of (tag, length) pairs. A NULL tag value (the tags are actually pointers) means: length many ordinary, plain text, data characters. Other tag values indicate (in YASP) replacement text from CDATA and SDATA entities.

In a CommonMark interface, a construct like this could distinguish portions in a MD_CHAR[] string as (unresolved) entity references, character references, "SmartyPants" items, emojis, ets. Basically any partition of the whole string into tagged, non-overlapping, sequential segments.

mity commented 7 years ago

@tin-pot Thanks for the discussion. It helped me to arrange my thoughts about the issue a lot.

I'm going to implement it by introducing new type into the API:

/* String attribute. 
 *
 * This wraps strings which are outside of a normal text flow, but which still
 * may contain string portions of different types like e.g. entities.
 */
typedef struct MD_ATTRIBUTE_tag MD_ATTRIBUTE;
struct MD_ATTRIBUTE_tag {
    const MD_CHAR* text;
    MD_SIZE size;

    /* Sequence of text types covering whole string size.
     * Note that these immutable conditions are always held:
     *  -- substr_offsets[0] == 0
     *  -- substr_offsets[LAST+1] == size
     *
     * So, for example, lets consider an image has a title attribute string
     * set to "foo &quot; bar". (Note the string size is 14.)
     *
     * Then:
     *  -- [0]: "foo "   (substr_types[0] == MD_TEXT_NORMAL; substr_offsets[0] == 0)
     *  -- [1]: "&quot;" (substr_types[1] == MD_TEXT_ENTITY; substr_offsets[1] == 4)
     *  -- [2]: " bar"   (substr_types[2] == MD_TEXT_NORMAL; substr_offsets[2] == 10)
     *  -- [3]: (n/a)    (n/a                              ; substr_offsets[3] == 14)
     */
    const MD_TEXTTYPE* substr_types;
    const MD_OFFSET* substr_offsets;
};

And all the relevant detail structures MD_BLOCK_xxx_DETAIL and MD_SPAN_xxx_DETAIL shall use that instead of the pair of e.g. ::href + href_size pair.

This gives a renderer all the info, and liberty to easily use the string as a whole (without the translation); or to iterate over the substrings of different types easily. It is also easily extensible if/when we implement new things like SmartyPants or Emoji (as optional extensions, of course, as they are not part of CommonMark spec).

tin-pot commented 7 years ago

This looks alright to me. According to your description, a renderer could process all the portions (or chunks, or segments, ...) in the text like this, right?

/*
 * Process one "chunk" of text, given its type.
 */
void process(MD_TEXTTYPE, const MD_CHAR[], MD_SIZE);

const MD_ATTRIBUTE *pattr; /* Passed in from the parser */

/*
 * Loop over chunks in `*pattr`, and process them.
 */
MD_OFFSET ofs;
size_t    k;

for (k = 0; (ofs = pattr->substr_offsets[k]) < pattr->size; ++k) {
    const MD_CHAR *const chunkp  = pattr->text + ofs;
    const MD_SIZE        chunksz = pattr->substr_offsets[k+1] - ofs;
    const MD_TEXTTYPE    chunktp = pattr->substr_types[k];

    process(chunktp, chunkp, chunksz);
}

Using an array of sizes instead of offsets would lead to a loop construct like this. I'm not sure which is better; there's probably not that much difference …

MD_OFFSET ofs;
MD_SIZE   sz;
size_t    k;

for (k = 0; ofs < pattr->size; ofs += sz, ++k) {
    const MD_CHAR *const chunkp  = ofs;
    const MD_SIZE        chunksz = (sz = pattr->substr_sizes[k]);
    const MD_TEXTTYPE    chunktp = pattr->substr_types[k];

    process(chunktp, chunkp, chunksz);
}

[ Edit: Well, on second thought, the first version, using offsets, looks a bit easier to write and understand. But I would be fine with either one of these API constructs. ]

mity commented 7 years ago

According to your description, a renderer could process all the portions (or chunks, or segments, ...) in the text like this, right? (skipped the code)

Yes, exactly.

Using an array of sizes instead of offsets would lead to a loop construct like this. I'm not sure which is better; there's probably not that much difference … (skipped the code)

I prefer the offsets as app then does not need the accumulator. But that's just matter of personal preference and in this it's impossible to satisfy everybody...

tin-pot commented 7 years ago

Yes, accumulating seems less straightforward than just "computing" the size as ofs[k+1]-ofs[k].

I think I'd prefer that too, I just wanted a quick "test ride" with both API variants ;-)