jgm / djot

A light markup language
https://djot.net
MIT License
1.66k stars 43 forks source link

Intraword emphasis #101

Open matklad opened 1 year ago

matklad commented 1 year ago

See discussion at https://github.com/jgm/djot/discussions/94#discussioncomment-4127713

TL;DR: it's not clear if fan_tas_tic should contain an em. It does now.

Problem with such implicit emphasis inside the words is that they tend to be greedy and worm themselves into text which wasn't intended to be emphasized, like file_nam.es or large numbers 1_000_000 or urls http://use_dashes_in_urls.com.

An "obvious" fix is to require explicit syntax for intraword case: fan{_tas_}tic. The problem is that its unclear what constitutes the word. We don't want to use unicode definition of the word, as it is our general principle that we don't want to bring in unicode tables. If we define word as non-ascii whitespace, some examples stop working, like *_nested_* or (*parens*) or ¡*Hola*!.

matklad commented 1 year ago

I think the fix for nested case is easy: we can require delimiter to be next to whitespace or a delimiter, recognized by djot. So, things like [_link text_](http://...) or "_quoted emph_" would work, and potentialy (*parens*) (potentailly because I don't think djot is recognizing () by itself, but it's natural to say that it does).

The Hola example won't work, and things like **Really**???, but that seems niche (I'd expect adjacent punctuation to be inside emphasis in practice).

Note that we have precedence for treating punctuation like words in attribute sytnax: ¡Hola!{.emph}

bpj commented 1 year ago

See https://github.com/jgm/djot/discussions/94#discussioncomment-4138785

jgm commented 1 year ago

As @bpj noted in the linked thread, I think it's good practice simply to include identifiers or filenames containing underscores in verbatim backticks, which makes the problem go away.

I think the fix for nested case is easy: we can require delimiter to be next to whitespace or a delimiter, recognized by djot.

That would mean that, for example “_Dog_” would fail to be emphasized (note the curly quotes). I worry that this sort of half-solution is going to be too fragile and lead to too many confused users.

matklad commented 1 year ago

I worry that this sort of half-solution is going to be too fragile and lead to too many confused users.

I think there are two kinds of confusion:

The first confusion seems less perplexing than the second one: you are doing emphasis, so, if it doesn't work for whatever reason, you can use full syntax. In any case, emphasis is explicitly what you are trying to do.

The second confusing is more confusing, because you are doing something completely unrelated, and now your code is in italics, and you need to do a mental jump from whatever you were doing to "oh, this is because I have some unrelated underscores".

That is, false positives seem worse here than false negatives.

With the current rules in my blog porting exercise I've hit false positives several time and no false negatives (eg, 10 \_ and zero {_) ("grep" results)

jgm commented 1 year ago

You may be right that the first kind of confusion is more common, but I don't know if it's harder to fix. Figuring out why "_Dog_" is emphasized but “_Dog_” isn't might be very confusing for users. Explaining to them that djot treats ascii punctuation differently from non-ascii punctuation might be pretty confusing. Adding a backslash escape isn't hard and always works for the first case.

matklad commented 1 year ago

Yeah, I don't think it's harder to fix, I think it might be harder to realize what the fix is needed.

In the first case, you try to create an emphasis, type “_Dog_”, it doesn't work, you try with explicit syntax.

In the second case, you are writing a nice number, like 33_550_336, glance at the output, and see 33550336 which looks subtly wrong, but it's unclear why -- you are just inputting a number, right?

But that's 100% me trying to role-play, don't have a good feeling how it all will play out in practice.

jgm commented 1 year ago

The number case is a good one -- unlike filenames and identifiers, that's probably not something you'd put in verbatim backticks. (In what contexts are numbers written this way?)

matklad commented 1 year ago

(In what contexts are numbers written this way?)

That’s my idiosyncratic way to say “a lot” basically, don’t think it’s a common convention or anything, an example would be

which gives us 307_200_000 ray-triangle intersection tests

bpj commented 1 year ago

The number case is a good one -- unlike filenames and identifiers, that's probably not something you'd put in verbatim backticks. (In what contexts are numbers written this way?)

In some programming languages (I am familiar with it from Perl) it is allowed to write numbers this way for readability (i.e. underscores inside numbers are ignored). I'd say that being a programming usage I'd expect it to be between backticks, and otherwise it is probably niched enough to expect those who use it to escape the underscores.

Either way I think the best "fix" is to say explicitly in the syntax specification that

  1. Intra-word underscores do count as emphasis delimiters according to the usual rules;
  2. therefore literal intra-word underscores should be typed as \_ (abc\_def\_ghi), but
  3. for clarity authors should use {_..._} (abc{_def_}ghi) for intra-word emphasis.

Conceivably it might, since Lua is unusual among modern programming languages in being Unicode-challenged, be declared to be up to implementations how they wish to treat intra-word underscores, which in practice would mean that both \_ and {_..._} would be de rigeur for source which wishes to be portable between implementations.

matklad commented 1 year ago

Conceivably it might, since Lua is unusual among modern programming languages in being Unicode-challenged,

The way I understand this, our reluctance to rely on unicode character classes is not due to limitations of a particular implementation, but a design decision to better support unicode documents. Saying "[-_*\[\](){} \t\r\n] are the only symbols recognized by djot, and everything else gets pasted into the output verbatim" is a sure-way to make sure we don't mess up user's unicode input, whatever that might be.

In particular, this makes us resilient to future changes in unicode. If, for example, we use unicode's conception of "word", than each version of unicode would formally need to cause a new version of djot, as unicode properties are not fixed in time.

Not sure this is exactly what jgm has in mind, but that's my wishful reading :) I do think that that's a great approach to tackle complexity here, and I'd love this to be a requirement for all impls (I think djot, unlike markdown, should explicitly discourage creation of flavors)

matklad commented 1 year ago

Figuring out why "Dog" is emphasized but “Dog” isn't might be very confusing for users.

Interestingly, we have a similar problem with attribute attachment:

"good boy"{.ascii}
“good boy”{.unicode}
jgm commented 1 year ago

That's right -- this is another place where we use Lua's pattern-matching to find "the last word" to which to attach an attribute. We could always drop this convenience altogether and just force people to use explicit spans, to avoid this sort of confusion.

jgm commented 1 year ago

Note that in

"good boy"{.ascii}

you're actually getting the attributes on the double_quoted node:

    double_quoted class="ascii"
      str text="good boy"

So here there is actually a container; it's not a bare word.

jgm commented 1 year ago

We could make the following change to can_open/can_close calculations for inline containers:

diff --git a/djot/inline.lua b/djot/inline.lua
index da9e8cd..8fdc052 100644
--- a/djot/inline.lua
+++ b/djot/inline.lua
@@ -98,8 +98,10 @@ function Tokenizer.between_matched(c, annotation, defaultmatch, opentest)
   return function(self, pos)
     local defaultmatch = defaultmatch or "str"
     local subject = self.subject
-    local can_open = find(subject, "^%S", pos + 1)
-    local can_close = find(subject, "^%S", pos - 1)
+    local can_open = find(subject, "^%S", pos + 1) and
+                       (pos == 0 or find(subject, "^[%s%p]", pos - 1))
+    local can_close = pos > 0 and find(subject, "^%S", pos - 1) and
+                        find(subject, "^[%s%p]", pos + 1)
     local has_open_marker = matches_pattern(self.matches[pos - 1], "^open%_marker")
     local has_close_marker = byte(subject, pos + 1) == 125 -- }
     local endcloser = pos

This has the following effects on tests, some of which are fine and some undesirable:

FAILED at attributes.test line 105
--- INPUT -------------------------------------
{#id .cla*ss*
--- EXPECTED ----------------------------------
<p>{#id .cla<strong>ss</strong></p>
--- GOT ---------------------------------------
<p>{#id .cla*ss*</p>
-----------------------------------------------

FAILED at super_subscript.test line 1
--- INPUT -------------------------------------
H~2~O
--- EXPECTED ----------------------------------
<p>H<sub>2</sub>O</p>
--- GOT ---------------------------------------
<p>H~2~O</p>
-----------------------------------------------

FAILED at super_subscript.test line 7
--- INPUT -------------------------------------
mc^2^
--- EXPECTED ----------------------------------
<p>mc<sup>2</sup></p>
--- GOT ---------------------------------------
<p>mc^2^</p>
-----------------------------------------------

FAILED at super_subscript.test line 13
--- INPUT -------------------------------------
test^of superscript ~with subscript~^
--- EXPECTED ----------------------------------
<p>test<sup>of superscript <sub>with subscript</sub></sup></p>
--- GOT ---------------------------------------
<p>test^of superscript <sub>with subscript</sub>^</p>
-----------------------------------------------

FAILED at emphasis.test line 29
--- INPUT -------------------------------------
foo*bar*baz
--- EXPECTED ----------------------------------
<p>foo<strong>bar</strong>baz</p>
--- GOT ---------------------------------------
<p>foo*bar*baz</p>
-----------------------------------------------

FAILED at emphasis.test line 63
--- INPUT -------------------------------------
foo_bar_baz
--- EXPECTED ----------------------------------
<p>foo<em>bar</em>baz</p>
--- GOT ---------------------------------------
<p>foo_bar_baz</p>
-----------------------------------------------

FAILED at emphasis.test line 69
--- INPUT -------------------------------------
aa_"bb"_cc
--- EXPECTED ----------------------------------
<p>aa<em>&ldquo;bb&rdquo;</em>cc</p>
--- GOT ---------------------------------------
<p>aa_&ldquo;bb&rdquo;_cc</p>
-----------------------------------------------

FAILED at emphasis.test line 107
--- INPUT -------------------------------------
_(_foo_)_
--- EXPECTED ----------------------------------
<p><em>(</em>foo<em>)</em></p>
--- GOT ---------------------------------------
<p><em>(<em>foo</em>)</em></p>
-----------------------------------------------

238 tests completed in 0.012 s
PASSED:  230
FAILED:    8
ERRORS:    0

Of course, one could decide to treat the _ delimiter specially -- which is what a lot of markdown implementations do. Or perhaps all emphasis/strong emphasis delimiters, but not delimiters for things like superscript. But I think there's some value in keeping things uniform and predictable.

jgm commented 1 year ago

One approach could be to just bite the bullet and require H{~2~}O and e=mc{^2^}.

Another possibility would be to give a radically different treatment to super and subscript. Instead of behaving like the other "surround" syntax, it could work this way:

H~2O = subscripted 2 (next single character) (note, with the change to inline emphasis, this wouldn't be confused with emphasis marking) H{~22~}O = subscripted 22 mc^2 = superscripted 2 mc{^the first even prime^}



On balance though I think I prefer uniformity and `H{~2~}O`.
matklad commented 1 year ago

Another bad example by @jamii:

_[foo](foo_bar)_
catwell commented 1 year ago

I am migrating my own blog (from lunamark) and hit this issue in URLs (typically Wikipedia URLs).

I am unsure how to fix it, for now I will probably just encode the URLs.

catwell commented 1 year ago

I thought about that a bit more, and I feel like the parser should just ignore all openers and closers except parens when it is processing the location part of a link. That will not fix other instances of this but they can be solved with the curly bracket syntax.

Do you agree? If so I can try to fix it.

dumblob commented 1 year ago

@catwell in CommonMark the solution is to add space before the closing parenthesis in the source text. I use it this way for at least 10 years now and never had an issue with URLs any more in any system.

andersk commented 1 year ago

@dumblob In @catwell’s example [link](foo_(bar_qux)), the first closing paren is supposed to be part of the URL. Spaces don’t help. CommonMark has a balanced parenthesis rule to deal with this. Djot evidently requires the parens to be escaped: [link](foo_\(bar_qux\)) or [link](foo_%28bar_qux%29). In any case, this is a separate issue from intraword emphasis, and it’s already been filed:

catwell commented 1 year ago

@andersk I do think it's related to intraword emphasis. It only happens due to the interaction of emphasis and parens in the URL.

For instance, your example breaks:

> input = "[Paxos](https://en.wikipedia.org/wiki/Paxos_(computer_science))"
> djot.render_ast_pretty(djot.parse(input))
doc
  para
    link destination="https://en.wikipedia.org/wiki/Paxos_(computer_science"
      str text="Paxos"
    str text=")"

But if you get rid of an underscore it does not:

> input = "[Paxos](https://en.wikipedia.org/wiki/Paxos_(computer-science))"
> djot.render_ast_pretty(djot.parse(input))
doc
  para
    link destination="https://en.wikipedia.org/wiki/Paxos_(computer-science)"
      str text="Paxos"

Still it's good that you filled it separately.

mikekasprzak commented 6 months ago

I'm undecided on this, but I don't this is a deal breaker for me, just not ideal. I too like to throw un_der_scores in the middle of a word for fun, or S_P_E_L_L out a word with them. I just wanted to mention that you can encounter similar output mischief writing math using * as multiplication.

C=65*8+1_000_000*8

That said it's fair to expect math to be backticked.

As I ponder it, I think I would prefer intraword emphasis to work and not be emphasized, but I haven't thought through all of the consequences. Personally I'm okay with "_Dog_" working while “_Dog_” doesn't, but I will admit having the option to differentiate between more types of punctuation and whitespace (making both work) would be ideal.

Some user confusion might be solvable with UX.

Just thinking out loud.