rw251 / getset

An application for creating, validating, reusing and extending sets of clinical codes.
https://getset.herokuapp.com
4 stars 0 forks source link

Regex searching #11

Open richard-uk1 opened 4 years ago

richard-uk1 commented 4 years ago

It would be handy if I could perform regular expression searching in addition to literal text search. This would also solve #1 and potentially #7.

rw251 commented 4 years ago

I want to avoid the need for regex.

I agree for some users it would be helpful. However, the whole point of term sets is that the end result is more easily understood than a code set by itself. A short list of terms is demonstrably quicker and easier to digest and to determine the author's intent. If, however we allowed unlimited regex then you could in theory create term sets that were unintelligible to people without regex experience which is what I want to avoid.

The compromise is that the wildcard * is allowed. If you can provide examples of any regex you would like to see then we can discuss further.

richard-uk1 commented 4 years ago

I understand your reasoning. As an alternative, I've made a personal shopping list of features I would like. It's up to you whether you think they would be useful to others.

  1. Understanding the word boundary rules
    1. What constitutes a word boundary/whitespace (e.g. is '-' included?)
    2. Do the words have to occur in order (e.g. would the search string foo bar match the rubric bar foo)?
    3. Do the words have to occur concurrently (e.g. would the search string foo baz match the rubric foo bar baz)?
    4. Do words match in a "LIKE %test%" style (e.g. would the search string bar match the rubric foobarbaz)?
    5. How are characters treated when they are neither a letter/number or a word boundary (examples are '#', '_', '+' etc.)? I think I would prefer it if they were treated exactly the same as letters.
    6. How are non-ascii characters treated? I have seen these in the Salford data (e.g. squared ²) I would guess that applying unicode normalization would be enough.
  2. The ability to have a literal quote which must match exactly (including whitespace). For example, the search string "foo bar" would match bazfoo bar but not foo bazbar.
  3. The ability to match zero or more characters using the * wildcard (except in literal quotes "").
  4. Possibly the option to match a single character (usually with ?). To be honest I don't see myself using this much.

I don't think I would need escaping (e.g. a literal "?") because I can always use a quoted literal.

I think my ideal algorithm would be

  1. Split the search string into tokens. Tokens are separated by whitespace or delimited by double quotes, for example
    1. foo bar -> foo, bar
    2. "foo bar" -> foo bar
    3. foo*bar -> foo<wildcard>bar
    4. (quoted special char) "foo*bar" -> foo*bar
    5. (" always creates a boundary) "foo bar"baz -> foo bar, baz
    6. (odd number of ") "foo bar" "baz -> foo bar, baz (treat the unclosed last literal as a literal to the end of the input string). An alternative would be to reject the search string as invalid.
  2. For a code to match, the rubric (including other searchable text if it exists) must contain all tokens (expanding wildcards) in any order. It doesn't matter whether the token is found at the end or middle of a word, or if it is the whole word. Case insensitive.

That would match the behaviour I would intuitively expect. But of course, other people might have different intuition. Also, it might be that you already specified all this when writing the papers you already published, so it's difficult to change them. Definitely think of my suggestions as opinion rather than factually correct.

EDIT I've edited this comment rather a lot, all on 2020-07-07.

richard-uk1 commented 4 years ago

Implementing the algorithm could either be done by converting to a regex, taking care to escape literal characters with a meaning in regex, or something like (using typescript to show the typed data structures)

// from https://github.com/orling/grapheme-splitter
let splitter = new GraphemeSplitter();

interface Token {
    chars: Char[];
}

interface Char {
    ty: CharType;
    ch: string; // enforced to be a single grapheme cluster.
}

function lit(ch: string) -> Char {
    return {
        ty: Literal,
        ch
    };
}

function wild() -> Char {
    return {
        ty: WildCard,
        ch: null
    };
}

type CharType = Wildcard | Literal;

function parse(input: string): Token[] {
    let tokenBuf: Char[] = [];
    let inLiteral: boolean = false;
    let tokens: Token[] = [];
    for (let ch of splitter.splitGraphemes(input)) {
        if (ch === "\"") {
            if (inLiteral) {
                // end of literal
                if (tokenBuf.length > 0) {
                    tokens.push(tokenBuf);
                    tokenBuf = [];
                }
                inLiteral = false;
            } else {
                // start of literal
                if (tokenBuf.length > 0) {
                    tokens.push(tokenBuf);
                    tokenBuf = [];
                }
                inLiteral = true;
            }
        } else if (inLiteral) {
            // we're in a literal so we push `ch` whatever it is.
            tokenBuf.push(lit(ch));
        } else if (ch === "*") {
            tokenBuf.push(wild());
        } else if (isWhitespace(ch)) {
            // start a new token if we haven't already.
            if (tokenBuf.length > 0) {
                tokens.push(tokenBuf);
                tokenBuf = [];
            }
        } else {
            // It's a normal character, add it to the current word
            tokenBuf.push(lit(ch));
        }
    }
    if (tokenBuf.length > 0) {
        tokens.push(tokenBuf);
    }
    return tokens;
}

function isWhitespace(input: string) -> boolean {
    // TODO decide what constitutes whitespace
    /^\s$/.match(input)
}

function test(matcher: Token, input: string): boolean {
    // TODO write a deterministic finite automata, either directly or using regex.
    throw new Error("unimplemented");
}
rw251 commented 4 years ago

I've raised an issue #17 for making sure that whatever the search strategy it is at least documented for users - I've responded to most of your wish list there as it is a great starting point for the documentation.

The algorithm is almost exactly as you describe, so that is reassuring. The differences are:

(quoted special char) "foo*bar" -> foo*bar

I'm assuming there is never a need to search for a * character literal therefore we treat * as a wildcard even when quoted. Quotes effectively allow you to "fix" the word order, and being able to still use wildcards within that is useful - at least it's more useful than being able to search for a *.

(" always creates a boundary) "foo bar"baz -> foo bar, baz
(odd number of ") "foo bar" "baz -> foo bar, baz (treat the unclosed last literal as a literal to the end of the input string). An alternative would be to reject the search string as invalid.

I'd probably just treat them as invalid and flag to the user.

For a code to match, the rubric (including other searchable text if it exists) must contain all tokens (expanding wildcards) in any order. It doesn't matter whether the token is found at the end or middle of a word, or if it is the whole word. Case insensitive.

Yes - apart from the middle of a word. I did a lot of performance testing, and although you can get it pretty good with character searching, there are some cases where it's unacceptably slow, and it's a lot faster in virtually all situations if the searching is at the word level. Therefore the default search is whole word based. Middle of word searching is still possible if e.g. your search term is *foo*, it's just not the default behaviour.