Consider using a hand-written tokenizer for GitHub references

Our current GitHub reference detection uses regular expressions to extract known reference patterns from a block of text. It is possible (if not easy) to reason about the regular expressions used to match a particular kind of reference. But is is much harder to reason about how these interact. For instance, in the current system, the string

https://github.com/sourcecred/sourcecred/issues/1#@foo

parses to two references (!):

[
  {
    "refType": "BASIC",
    "ref": "https://github.com/sourcecred/sourcecred/issues/1"
  },
  {
    "ref": "@foo",
    "refType": "BASIC"
  }
]

This is because a URL is allowed to end in a fragment specifier #, which is not a word character [A-Za-z0-9_], and a username reference is required to be preceded by a non-word character.

Because the regular expressions used to match the different reference types all execute independently, it is hard to reason about scenarios like this one. The reader must consider the cross product of “the end of one regular expression” and “the start of another”, for instance. Of course, attempting to combine all the different reference types into one giant regular expression would likely be a bad idea, especially considering that paired-with references work by computing a difference of regular expression matches.

However, suppose that we reconsider the premise of using a regular expression to find references globally in a string. Instead, we may convert the input to a stream of tokens, and then make a single pass over that token stream, identifying all kinds of references at once. We can use similar, but simpler, regular expressions to check whether a particular token is particular kind of reference. Then, these expressions wouldn’t need to worry about adding non-capturing groups (?:\W|^) and lookaheads (?=\W|$) for detecting word boundaries (and the resulting subtleties that this entails).

We would also gain the ability to reason more clearly about interactions by modifying state as we traverse the token stream: for instance, the rule “if you see the token Paired followed by the token with, then the next token is considered to be a paired-with reference instead of a username reference” is easy to express, and eliminates the need for the funky multiset-difference in the current implementation. For parsing URL references, we would gain the ability to defer to an actual URL-parsing library to handle all the special cases that we’re likely to forget.

In summary, I suspect that the resulting system would be a simple state machine that is easy to understand.

Note that hand-writing a tokenizer is quite easy, and could plausibly require not much more than (s) => s.split(" ").

This change would be highly self-contained; only the parseReferences module would need to change. It also probably falls pretty low on the priority list: though I do believe that it would greatly improve the robustness of that module, that module’s robustness simply isn’t critical to us right now. In the long term, where we have the resources to expend on polishing all aspects of the system, this is definitely something that I would like to see. (Or, in other words: contributions welcome!)

cc decentralion

┆Issue is synchronized with this Asana task by Unito

sourcecred / sourcecred

Consider using a hand-written tokenizer for GitHub references #481