Option to switch priority between user-defined tokens and built-in tokens

mqudsi commented 6 years ago

Hi there,

Awesome project. I love the strict approach taken to validation and the owl utility is really nifty and makes writing a definition for an existing language a breeze.

However I'm running into an issue that I think can be rather easily solved in the library: the current token classes are non-comprehensive, in that there are symbols that don't fall into the category number | string | identifier.

For context, I'm actually trying to use owl to model a shell scripting language, where there are two issues that don't play well with owl's current architecture: tokens not matching a reserved pattern should be treated as strings ("literal data") and line breaks are special characters that shouldn't be treated as generic whitespace (I'll file another issue for that separately).

This first issue should be solvable by simply introducing a final token class other or unmatched which should take literally anything so long as no keyword is expected at that point, and can be used to accept un-quoted content where it does not introduce ambiguities into the parser.

ianh commented 6 years ago

I agree this would be useful; the tough part is defining exactly when the "other" token would end.

I'm planning to add custom token support soon, which may take care of this. Do you think an API like the one in https://github.com/ianh/owl/issues/2#issuecomment-409324925 would be enough to solve the problem? Then you could define your own other token class and match it using a callback.

mqudsi commented 6 years ago

Sorry for the late response.

I'm not sure that feature would tackle this issue, however. It would require that the application be able to distinguish whether or not a keyword is a token in its present location, which would mean the application would need to implement its own parser just to find out.

mqudsi commented 6 years ago

So I gave this another gander and I looked at the recently implemented user-defined tokens (thanks for that!). The primary blocker is this:

If there's a conflict between a user-defined token match and another token match, the longest match will be chosen, with ties going to keywords first, then user-defined token types, then other token types (like identifier and so on).

Having an option to flip the order of the last two would take care of this, as it would allow user-defined tokens to only be considered after all others have failed to match. (Whether an attempt to match a user-defined token is made before all other options are first exhausted would be just a question of optimization.)

ianh commented 6 years ago

Would you mind giving a few examples of strings and how you'd want them to be tokenized? The reason conflicts are resolved in this order is to allow you to override the built-ins—it might make sense to consider everything as a user-defined token and ignore the built-in types here.

ianh / owl

Option to switch priority between user-defined tokens and built-in tokens #3