Documentation on Custom Tokenization

TheThirdOne commented 6 years ago

I can't find any resources on how to define a custom tokenizer other than the hex-color example. That example seems to be considerable simpler than custom tokenizers owl can currently support (for example not using the tokenization info).

Additionally, knowing more details about how the default tokenizer works would be good as well. Particularly how whitespace is handled. I would have no idea how to make a whitespace parser from looking at the current examples and readme.

ianh commented 6 years ago

For more info about custom tokens, check out the user-defined tokens section of generated-parser.md in the docs.

The full details of how lexing works are in src/x-tokenize.h. Here's where whitespace is handled in the default lexing loop, for example. There has been some discussion of customizing whitespace handling (see #4), but I'd like to see more examples of what people want to do with it before committing to a design. Are you looking to implement Python-style indentation-based syntax or something else?

ianh commented 6 years ago

I just realized that the documentation doesn't mention the info parameter explicitly. I added some more detail in beef920.

TheThirdOne commented 6 years ago

I hadn't seen that section on user defined tokens in the docs.

The full details of how lexing works are in src/x-tokenize.h. Here's where whitespace is handled in the default lexing loop,

I definitely could read over the code carefully to learn what I need to, but having it explained in the docs is nice.

There has been some discussion of customizing whitespace handling (see #4)

I had seen that and at this point I could write a tokenizer that can do any of the whitespace related things that had popped into my head.

Are you looking to implement Python-style indentation-based syntax or something else?

I have a language in mind that occasionally needs newlines explicitly for parsing and a curiosity about how comments were coded without a newline token (I know know that is hardcoded).

It seems that the default tokenizer is pretty different from what I would need to do the type of parsing I want to do. I guess the one question left on my mind is how to completely overwrite it so that only a custom tokenizer is active.

Do lines like this get added by simply not referencing the token in the grammar?

#define IF_NUMBER_TOKEN(...) if (0) { /* no number tokens */  }

Is it even possible for a custom tokenizer to prevent the whitespace slurping here?

ianh commented 6 years ago

Is it even possible for a custom tokenizer to prevent the whitespace slurping here?

No, not at the moment. As you can maybe guess from its name, I originally wanted to allow overriding owl_default_tokenizer_advance entirely, but I couldn't find a good way to expose the information you'd need to do this cleanly.

Would it be enough to specify the whitespace characters explicitly? Something like this:

# only treat tabs and spaces as whitespace characters
.whitespace '\t' ' '

Then you could do whatever you want with the newline characters in the tokenizer function.

TheThirdOne commented 6 years ago

No, not at the moment. As you can maybe guess from its name, I originally wanted to allow overriding owl_default_tokenizer_advance entirely, but I couldn't find a good way to expose the information you'd need to do this cleanly.

Maybe just have another function which can be used instead (by boolean option) which has whitespace, and all other default tokenization removed. Perhaps which a few additional things the custom tokenizer can return to update the offset and such without making a token.

Something like specifying the whitespace characters could work. I think in non-extreme cases, that would work pretty well if whitespace actually has meaning. If nothing were specified, would that allow the tokenizer to specify all whitespace as special and then just handle it in the grammar?

ianh commented 6 years ago

Yeah, though lacking whitespace could make ambiguity reporting unreliable. For example, a grammar like:

x = y y
y = 'a' | 'a' 'a' | 'aa'

Would be reported as ambiguous ('a' 'a' 'a' -> "aaa"), but due to longest-match tokenization, "aaa" is not a real ambiguity (it's always parsed as "aa" "a"). With whitespace, Owl could report the ambiguity as "a a a", which is legitimately ambiguous. It's possible to check whether this can happen, but I'm leaning toward just giving a warning when reporting ambiguities without any whitespace specified.

ianh commented 6 years ago

I just pushed a bunch of changes to make .whitespace do what we discussed here. Custom tokenizer functions can also now return tokens with type OWL_WHITESPACE to treat a length of text as whitespace. The documentation should be up-to-date with these changes. Let me know if you have any feedback!

ianh commented 6 years ago

Closing this issue as resolved. Feel free to open another issue for any problems or questions.

ianh / owl

Documentation on Custom Tokenization #11