jgm / djot

A light markup language
https://djot.net
MIT License
1.67k stars 43 forks source link

attributes on bare words are English-biased #66

Closed matklad closed 1 year ago

matklad commented 1 year ago
hello{.en}
привет{.ru}
doc
  para
    str s="hello" class="en"
    softbreak
    str s="привет"
references = {
}
footnotes = {
}

I think both cases should either apply or not apply the attribute

jgm commented 1 year ago

Agreed that it's a bug. This one is due to the fact that Lua's built-in pattern syntax is not unicode-aware. We use something like

                local lastwordpos = string.find(prevnode.s, "%w+$")

to figure out where the last word starts, but this breaks on the Cyrillic characters. It's possible to do better here, it will just add some complexity to this part of the code.

jgm commented 1 year ago

By the way, an easier behavior to implement would be to put attributes on the previous node. This would mean that in

*six* blue dogs{.foo}

the foo class gets attached to " blue dogs" rather than just "dogs". This change would remove the need to detect "words," which is actually quite hard without proper unicode libraries. However, it is much more intuitive to the reader/writer that the {.foo} would just attach to "dogs".

jgm commented 1 year ago

A simpler idea would be that the attribute applies to the last run of consecutive non-space characters (which non-space is defined in the ASCII way).

matklad commented 1 year ago

Yeah, but then ?{.heh} would be valid, and somewhat surprising, syntax. There's also pitfall around ascii whitespace, in that not all ws is ascii: https://doc.rust-lang.org/reference/whitespace.html

In particular, ltr / rtl marks feel like something which might have some interactions here, but maybe not.

Overall, my gut feeling is that attributes on words is a rather niche use-case, and asking the user to type [dogs]{.foo} isn't much of a burden.

jgm commented 1 year ago

What would be wrong with ?{.heh}?

I know that there is non-ascii whitespace, but we could ignore that for purposes of this feature.

I'm not sure how this would work with RTL languages.

I guess you're suggesting removing attributes on bare words. But then, what should happen when someone writes foo{.bar}?

matklad commented 1 year ago

What would be wrong with ?{.heh}?

Nothing, really, but it does look like an oddity. Can only object this on aesthetics grounds :) I guess what makes me uneasy is that {} becomes a greedy syntax which always applies, which makes it harder to detect typos and invalid syntax.

I guess you're suggesting removing attributes on bare words

Yup. As I mentioned in the other issue, I think this would also perhaps allow us to get rid of a leading . for attrs.

But then, what should happen when someone writes foo{.bar}?

The same as for a space here-> {.bar}. What exactly that should be I am unsure. It looks like today the parser basically eats the attribute. I think it's better to interpret just as text?

matklad commented 1 year ago

closed by https://github.com/jgm/djot/commit/ecbaf9d5dce2485544fd349c8e095e28dcaa01c7