Tracking context inside a grammar

joshgoebel commented 4 years ago

 * Limitations: There is no way (afaik) in highlight.js to assign classes based
 * on position. As a result, a TripleConstraint with a predicate and a datatype
 * will have the same class applied to both.

Could you provide a quick example of what you're talking about here? While it might be true it's not (currently) easy to do you should be able to define a sub-mode that's derives from the parent mode but is different... so something nested inside a block could be highlighted differently than the same thing would if it were outside the block.

But perhaps that's not what your'e getting at here at all.

ericprud commented 4 years ago

In the following example, the prefixed names schema:name and xsd:string get marked up the same way:

<PersonShape> {
  schema:name xsd:string ;
}

The production for everything inside the '{}'s (tripleExpression) contains the production for prefixedName

  productions.tripleExpression = {
    begin: common.iris_RE,
    end: common.EndOfDocument,
    returnBegin: true,
    endsWithParent: true,
    keywords: shapeExpression_keywords,
    contains: [productions.IRIREF, productions.prefixedName].concat(shapeExprContentModel),
    relevance: 0
  }

so if it matches twice, e.g. on "schema:name" and "xsd:string", it gets the same annotation.

The highlighter could help the user by marking them differently but I haven't figured out how to say "after matching IRIREF or prefixedName once, switch to production X" (which can match them again, but with different classNames). In the ace ShExC mode (see demo), an IRI in a tripleExpression dives into a nested production which (with some digging) annotates an IRI with datatype. The demo only bolds datatypes, but that just reflects my lack of CSS creativity.

joshgoebel commented 4 years ago

Yeah, this isn't intuitive... this is the kind of thing callbacks would help with. But you can do it now, just it's messy.

You have to chain rules. You can do that with either a parent/child relationships and playing with the rule terminations or it's probably simpler with starts.

So to match say two terms "abc xyy" with a match that treats the 2nd one differently:

// rough pseudo-code
{ // rule "one"
begin: \w{3}
className: "termOne",
// since there is no end, will immediate end and starts will be triggered
// or you can eat extra spacing, etc with end and excludeEnd
starts: {
  contains: [{
  // our second term
  begin: \w{3}
  className: "termTwo"
  // rule also immediate ends, control returns to it's parent (the starts mode), which immediately ends
  // also then control returns to the parent of rule "one"
  // actually you might need `endsParent` here to prevent matching more than one termTwo
  }]
}

Hopefully that gives you the idea. If you had to do this all the time you'd write a helper for it.

IE, the only way to track grammar now is the mode tree. (and only very simple context at that)

ericprud commented 4 years ago

Edit: updated code below to make it work, look for end: /\B\b/.

i tried this doc in extra:

<html>
  <head>
    <title>context-sensitive grammar in highlightjs</title>
    <link rel="stylesheet" href="../../build/styles/default.min.css"/>
    <style>
.hljs-termOne { color: red; }
.hljs-termTwo { color: blue; }
    </style>
  </head>
  <body>
    <pre><code class="toy">
      one two one two one two
    </code></pre>
    <script src="../../src/highlight.js"></script>
    <script>
hljs.registerLanguage("toy", function () {
  return function (hljs, options = {}) {
    return {
      contains: [
        {
          className: "termOne",
          begin: /\w{3}/,
          starts: {
            end: /\B\b/, // Added following @yyyc514's advice below
            contains: [{
              className: "termTwo",
              begin: /\w{3}/,
              endsParent: true
            }]
          }
        }
      ]
    }
  }
}());
hljs.initHighlightingOnLoad();
    </script>
  </body>
</html>

but only got termOnes (red):

<code class="toy hljs">
      <span class="hljs-termOne">one</span> <span class="hljs-termOne">two</span> <span class="hljs-termOne">one</span> <span class="hljs-termOne">two</span> <span class="hljs-termOne">one</span> <span class="hljs-termOne">two</span>
    </code>

Any advice?

Edit: now works.

joshgoebel commented 4 years ago

Child modes MUST match something or they will end (since you didn't specify end anywhere it'll always try to end FAST). You starts mode isn't matching the space, so it ends because the second term doesn't immediately follow the first.

You need a rule to eat spaces. Or you could change the end rule of your starts block to be a non-match... ie a regex that can't possible match anything, I looked that up the other day but can't recall off the top of my head.

Of course then you need to be SURE a 2nd term is following or you'll get stuck. So spaces might be a better approach.

ericprud commented 4 years ago

added end: /\B\b/ to make it work, tx!

Will keep the issue open until i apply this tech to the ShExC and maybe SPARQL and Turtle grammars in the multi-lang branch.

joshgoebel commented 4 years ago

Then if you really want to get fancy you can whip up a helper to build those objects for you (if you needed the pattern often):

requireSequence([
{  className: "termOne", begin: /\w{3}/ },
{  className: "termTwo", begin: /\w{3}/ }
]

I guess you'd have to think of a way to encode the "allow spaces or not" type info...

I'd really love to allow naming of sub-match expressions (like you see in Textmate grammars and such) but JS has no way to pull location data from them...

ericprud commented 4 years ago

i'm trying to make a production that's indirectly called by starts consume until it hits a closing delimiter, but it seems to return as soon as any match is made. Any tricks to get around thati?

https://github.com/highlightjs/highlightjs-shexc/commit/595f7fdec64ca288418acc7c060ca79d9e9625ea

joshgoebel commented 4 years ago

I'm not sure I follow what you're trying to do. and I'm not sure self works with starts, that seems strange to me. If [] is recursive then self belongs inside contains, not start.

So once value opens it shouldn't close until ] is found... and if it's closing prematurely due to a SECOND second of brackets then typically you handle that with contains: [self].

joshgoebel commented 4 years ago

You can add ADDITIONAL end matches inside contains with endsParent...

joshgoebel commented 4 years ago

When trying to build up something complex the best thing to do is get it workign with the simplest possible case, then commit that, then slowly add additional tests cases one by one and expand the matches each time.

joshgoebel commented 4 years ago

It might be easier to see a failing markup test of what you're trying to accomplish.

highlightjs / highlightjs-shexc

Tracking context inside a grammar #4