In search, a symbol for "any non-numeric character"

synchrony / smsn

Semantic Synchrony. An experiment in cognitive and sensory augmentation.

Other

178 stars 15 forks source link

In search, a symbol for "any non-numeric character" #53

Open JeffreyBenjaminBrown opened 7 years ago

JeffreyBenjaminBrown commented 7 years ago

If I'm searching for sum, I'll probably search for *sum*. Otherwise I'll miss (sum or sum. or other punctuation-adjoined instances of the word. However, by using * I expand the search to include sumo and assume and other things I'm not looking for.

I googled for a while and still don't know whether it's possible.

joshsh commented 7 years ago

Probably not possible using Lucene syntax. In regular expressions, that would be a character set like [^A-Za-z], but I don't think Lucene supports regex. We could add support at the filter level if the use cases are compelling enough to justify breaking from Lucene.

JeffreyBenjaminBrown commented 7 years ago

"support at the filter level"? Does that mean rewriting Lucene?

joshsh commented 7 years ago

No, it means defining a new, SmSn-specific query syntax, and mapping expressions in that syntax to Lucene syntax (then filtering on the results). Since Lucene syntax is so well-known, the pros (more expressive queries) would have to be pretty significant to outweigh the cons (a syntax everyone has to learn from scratch, and an implementation we have to develop, test, and maintain).

JeffreyBenjaminBrown commented 7 years ago

I'm not (yet) suggesting any radical changes to Lucene, just a symbol for non-alphanumeric characters.

But while we're on the subject ...

I have nodes that look like this: practical & alone. {easy money}. The trailing period indicates that "practical & alone" is a complete thought. Then the bracketed expression provides an example of the content of the category. It would be valuable to me if I could search in only the first sentence, or search on the whole note but then rank according to the length of the first sentence rather than the whole note.

JeffreyBenjaminBrown commented 7 years ago

You suggested you want more use cases. Consider (I did this reacently) searching for notes about yourself, by searching for (probably among other words) the word "me". If you want to find it when it's adjacent to punctuation, you'll have to surround it with * symbols. But to do that is to find every word with me in it, which is a humongous number of words.