Our current GitHub reference detection uses regular expressions to
extract known reference patterns from a block of text. It is possible
(if not easy) to reason about the regular expressions used to match a
particular kind of reference. But is is much harder to reason about how
these interact. For instance, in the current system, the string
This is because a URL is allowed to end in a fragment specifier #,
which is not a word character [A-Za-z0-9_], and a username reference
is required to be preceded by a non-word character.
Because the regular expressions used to match the different reference
types all execute independently, it is hard to reason about scenarios
like this one. The reader must consider the cross product of “the end of
one regular expression” and “the start of another”, for instance. Of
course, attempting to combine all the different reference types into one
giant regular expression would likely be a bad idea, especially
considering that paired-with references work by computing a difference
of regular expression matches.
However, suppose that we reconsider the premise of using a regular
expression to find references globally in a string. Instead, we may
convert the input to a stream of tokens, and then make a single pass
over that token stream, identifying all kinds of references at once. We
can use similar, but simpler, regular expressions to check whether a
particular token is particular kind of reference. Then, these
expressions wouldn’t need to worry about adding non-capturing groups
(?:\W|^) and lookaheads (?=\W|$) for detecting word boundaries (and
the resulting subtleties that this entails).
We would also gain the ability to reason more clearly about interactions
by modifying state as we traverse the token stream: for instance, the
rule “if you see the token Paired followed by the token with, then
the next token is considered to be a paired-with reference instead of a
username reference” is easy to express, and eliminates the need for the
funky multiset-difference in the current implementation. For parsing URL
references, we would gain the ability to defer to an actual URL-parsing
library to handle all the special cases that we’re likely to forget.
In summary, I suspect that the resulting system would be a simple state
machine that is easy to understand.
Note that hand-writing a tokenizer is quite easy, and could plausibly
require not much more than (s) => s.split(" ").
This change would be highly self-contained; only the parseReferences
module would need to change. It also probably falls pretty low on the
priority list: though I do believe that it would greatly improve the
robustness of that module, that module’s robustness simply isn’t
critical to us right now. In the long term, where we have the
resources to expend on polishing all aspects of the system, this is
definitely something that I would like to see. (Or, in other words:
contributions welcome!)
Our current GitHub reference detection uses regular expressions to extract known reference patterns from a block of text. It is possible (if not easy) to reason about the regular expressions used to match a particular kind of reference. But is is much harder to reason about how these interact. For instance, in the current system, the string
parses to two references (!):
This is because a URL is allowed to end in a fragment specifier
#
, which is not a word character[A-Za-z0-9_]
, and a username reference is required to be preceded by a non-word character.Because the regular expressions used to match the different reference types all execute independently, it is hard to reason about scenarios like this one. The reader must consider the cross product of “the end of one regular expression” and “the start of another”, for instance. Of course, attempting to combine all the different reference types into one giant regular expression would likely be a bad idea, especially considering that paired-with references work by computing a difference of regular expression matches.
However, suppose that we reconsider the premise of using a regular expression to find references globally in a string. Instead, we may convert the input to a stream of tokens, and then make a single pass over that token stream, identifying all kinds of references at once. We can use similar, but simpler, regular expressions to check whether a particular token is particular kind of reference. Then, these expressions wouldn’t need to worry about adding non-capturing groups
(?:\W|^)
and lookaheads(?=\W|$)
for detecting word boundaries (and the resulting subtleties that this entails).We would also gain the ability to reason more clearly about interactions by modifying state as we traverse the token stream: for instance, the rule “if you see the token
Paired
followed by the tokenwith
, then the next token is considered to be a paired-with reference instead of a username reference” is easy to express, and eliminates the need for the funky multiset-difference in the current implementation. For parsing URL references, we would gain the ability to defer to an actual URL-parsing library to handle all the special cases that we’re likely to forget.In summary, I suspect that the resulting system would be a simple state machine that is easy to understand.
Note that hand-writing a tokenizer is quite easy, and could plausibly require not much more than
(s) => s.split(" ")
.This change would be highly self-contained; only the
parseReferences
module would need to change. It also probably falls pretty low on the priority list: though I do believe that it would greatly improve the robustness of that module, that module’s robustness simply isn’t critical to us right now. In the long term, where we have the resources to expend on polishing all aspects of the system, this is definitely something that I would like to see. (Or, in other words: contributions welcome!)cc decentralion
┆Issue is synchronized with this Asana task by Unito