pipedown / noise

Nested Object Inverted Search Engine
https://noisesearch.org/
Apache License 2.0
323 stars 11 forks source link

Wildcard search didn't work as expected #30

Open tleyden opened 7 years ago

tleyden commented 7 years ago

Disclaimer: I didn't read the documentation :-)

I searched for:

find
    {name: ~= "geo*"}
return
    .

and got results:

{
  "_id": "14153",
  "cast": [],
  "episodes": [
    {
      "airdate": "2016-03-05",
      "airtime": "07:00",
      "name": "Funky Feathers",
      "number": 1,
      "runtime": 120,
      "season": 1,
      "summary": "<p>Brainteasers, wacky animal facts, hip-hopping birds, animated adventures, and a cup of Joe with your favorite vet.</p>",
      "url": "http://www.tvmaze.com/episodes/650336/nat-geo-wild-kids-1x01-funky-feathers"
    },
    {
      "airdate": "2016-03-12",
      "airtime": "07:00",
      "name": "Jungle Jamboree",
      "number": 2,
      "runtime": 120,
      "season": 1,
      "summary": "<p>Explore bizarre creatures in Wonderfully Weird; get inspired by the wildlife rescue team on Bandit Patrol; special guests in Dr. Pol Coffee Breaks; cute and cuddly animal buddies in Unlikely Animal Friends.</p>",
      "url": "http://www.tvmaze.com/episodes/650337/nat-geo-wild-kids-1x02-jungle-jamboree"
    },

Was expecting only results with "Geo*" in the name, like "George".

vmx commented 7 years ago

First of all, wildcard search is not supported yet. Now to the details:

What your show where are the episodes. It matched on the title of the show. If we return that you'll get:

find
    {name: ~= "geo*"}
return
    .name

Which returns

"Nat Geo Wild Kids"
"Geo Bee"

What happened with geo* is that it got stemmed to geo and hence matches those seen above.

OSHistory commented 6 years ago

First of. Great project. I am thinking about using it as a backend for my mainly text-based research. So I am also interested in the issue.

Are wildcards or regex on the roadmap? Perhaps you could also shortly elaborate on the following: Which stemmer is used? (and for which language) Best way to proceed when trying to glob or regex?

thx

vmx commented 6 years ago

@OSHistory Wildcards are on the roadmap, but sadly there's a huge lack of time, hence I don't know when this will happen.

The stemmer currently used is just a Rust wrapper around Snowball. We don't do any language specific things yet, so you get whatever Snowball does.

Adding wildcard/regex is non-trivial. Perhaps @Damienkatz could give a brief overview on what he had in mind in regards to that.

OSHistory commented 6 years ago

Thanks for the reply. I can imagine that regex implementation is a huge task to implement. I think i would be happy if the snowball-stemmer would support something else than english. And indeed in stem.rs one can simply change the language.

It compiles fine, however, due to no rust experience I am a little bit lost on how to include it in my local npm installation to test on my sample data which is in german. Would be gratefull on hints as how to do it or where to start.

Perhaps an option to specify a language for the stemmer on index creation might substantially increase flexibility for non-english use cases? Something along the lines of:

let index = noise.open("myindex", true, { "lang": "german" });

Most use cases should operate on a single language.

vmx commented 6 years ago

@OSHistory Could you please open another issue for supporting other languages as an option? This way it won't get lost that easily.

OSHistory commented 6 years ago

@vmx sure i was thinking the same thing while writing...