Open richard-uk1 opened 4 years ago
I want to avoid the need for regex.
I agree for some users it would be helpful. However, the whole point of term sets is that the end result is more easily understood than a code set by itself. A short list of terms is demonstrably quicker and easier to digest and to determine the author's intent. If, however we allowed unlimited regex then you could in theory create term sets that were unintelligible to people without regex experience which is what I want to avoid.
The compromise is that the wildcard * is allowed. If you can provide examples of any regex you would like to see then we can discuss further.
I understand your reasoning. As an alternative, I've made a personal shopping list of features I would like. It's up to you whether you think they would be useful to others.
foo bar
match the rubric bar foo
)?foo baz
match the rubric foo bar baz
)?bar
match the rubric foobarbaz
)?²
) I would guess that applying unicode normalization would be enough."foo bar"
would match bazfoo bar
but not foo bazbar
.*
wildcard (except in literal quotes ""
).?
). To be honest I don't see myself using this much.I don't think I would need escaping (e.g. a literal "?") because I can always use a quoted literal.
I think my ideal algorithm would be
foo bar
-> foo
, bar
"foo bar"
-> foo bar
foo*bar
-> foo<wildcard>bar
"foo*bar"
-> foo*bar
"
always creates a boundary) "foo bar"baz
-> foo bar
, baz
"
) "foo bar" "baz
-> foo bar
, baz
(treat the unclosed last literal as a literal to the end of the input string). An alternative would be to reject the search string as invalid.That would match the behaviour I would intuitively expect. But of course, other people might have different intuition. Also, it might be that you already specified all this when writing the papers you already published, so it's difficult to change them. Definitely think of my suggestions as opinion rather than factually correct.
EDIT I've edited this comment rather a lot, all on 2020-07-07.
Implementing the algorithm could either be done by converting to a regex, taking care to escape literal characters with a meaning in regex, or something like (using typescript to show the typed data structures)
// from https://github.com/orling/grapheme-splitter
let splitter = new GraphemeSplitter();
interface Token {
chars: Char[];
}
interface Char {
ty: CharType;
ch: string; // enforced to be a single grapheme cluster.
}
function lit(ch: string) -> Char {
return {
ty: Literal,
ch
};
}
function wild() -> Char {
return {
ty: WildCard,
ch: null
};
}
type CharType = Wildcard | Literal;
function parse(input: string): Token[] {
let tokenBuf: Char[] = [];
let inLiteral: boolean = false;
let tokens: Token[] = [];
for (let ch of splitter.splitGraphemes(input)) {
if (ch === "\"") {
if (inLiteral) {
// end of literal
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
tokenBuf = [];
}
inLiteral = false;
} else {
// start of literal
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
tokenBuf = [];
}
inLiteral = true;
}
} else if (inLiteral) {
// we're in a literal so we push `ch` whatever it is.
tokenBuf.push(lit(ch));
} else if (ch === "*") {
tokenBuf.push(wild());
} else if (isWhitespace(ch)) {
// start a new token if we haven't already.
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
tokenBuf = [];
}
} else {
// It's a normal character, add it to the current word
tokenBuf.push(lit(ch));
}
}
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
}
return tokens;
}
function isWhitespace(input: string) -> boolean {
// TODO decide what constitutes whitespace
/^\s$/.match(input)
}
function test(matcher: Token, input: string): boolean {
// TODO write a deterministic finite automata, either directly or using regex.
throw new Error("unimplemented");
}
I've raised an issue #17 for making sure that whatever the search strategy it is at least documented for users - I've responded to most of your wish list there as it is a great starting point for the documentation.
The algorithm is almost exactly as you describe, so that is reassuring. The differences are:
(quoted special char) "foo*bar" -> foo*bar
I'm assuming there is never a need to search for a *
character literal therefore we treat *
as a wildcard even when quoted. Quotes effectively allow you to "fix" the word order, and being able to still use wildcards within that is useful - at least it's more useful than being able to search for a *
.
(" always creates a boundary) "foo bar"baz -> foo bar, baz
(odd number of ") "foo bar" "baz -> foo bar, baz (treat the unclosed last literal as a literal to the end of the input string). An alternative would be to reject the search string as invalid.
I'd probably just treat them as invalid and flag to the user.
For a code to match, the rubric (including other searchable text if it exists) must contain all tokens (expanding wildcards) in any order. It doesn't matter whether the token is found at the end or middle of a word, or if it is the whole word. Case insensitive.
Yes - apart from the middle of a word
. I did a lot of performance testing, and although you can get it pretty good with character searching, there are some cases where it's unacceptably slow, and it's a lot faster in virtually all situations if the searching is at the word level. Therefore the default search is whole word based. Middle of word searching is still possible if e.g. your search term is *foo*
, it's just not the default behaviour.
It would be handy if I could perform regular expression searching in addition to literal text search. This would also solve #1 and potentially #7.