olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

indexing and search on nested properties? #100

Closed dtchang closed 6 years ago

dtchang commented 10 years ago

Can one index and search on nested properties, e.g., { "employees": [{ "name": "...", "address": { "city": "...", "street": "..." } }] }

If so, what is the path expression to be used in defining 'field'? I tried something like: this.field('employees.address.city'); or this.field('employees[].address.city'); but none worked.

dtchang commented 10 years ago

It has been two weeks. Can someone knowledgeable about lunr respond? Thanks.

olivernn commented 10 years ago

It is currently not possible and it isn't something that will be solved by lunr.

A solution would be for your code to map between whatever structure your documents/data is in into a flat object for indexing by lunr.

For example, say you have your employee object that looks like this:

{
    "id": 123,
    "name": "bob",
    "address": {
        "street": "some street",
        "city": "some city"
    }
}

You could set up your index like so:

idx = lunr(function () {
    this.field('name')
    this.field('address_street')
    this.field('address_city')
})

And setup a mapping function:

var employeeMapper = function (employee) {
    return {
        "id": employee.id,
        "name": employee.name,
        "address_street": employee.address.street,
        "address_city": employee.address.city
    }
}

Now lets say you have a collection of employees that you need to index, you can make use of your map function:

employees
    .map(employeeMapper)
    .forEach(function (doc) { idx.add(doc) })

The reason for this not being in lunr is that is easy to solve outside lunr and isn't really relevant to searching/indexing the documents.

matthiasg commented 6 years ago

very sad to see this not inside lunr.js itself, since it does increase the amount of ram and cpu required and increases pressure on the garbage collector by having copies around (less when text fields are large but still very noticeable in my scenarios)

kkirby commented 6 years ago

@matthiasg Late reply since I'm looking into this as well. But something that I'd like to point out a is that if Lunr were to do it for you it would use more CPU and RAM than if you did it for Lunr.

matthiasg commented 6 years ago

@kkirby no worries for the delayed response. Sorry but as for your comment, I do not follow yet.

When my code has to do it I have to build a new shallow copy object of all fields to index:

original:

{
  id: "myid",
 data: { 
   name:  "myname"
   myotherinfo: "myotherinfo"
 }
}

new document:

{
   id: "myid",
   "data.name":  "myname"
   "data.myotherinfo": "myotherinfo"
}

Now admittedly this data is just a shallow copy but it is a new document which requires garbage collection.

The way I would see it implemented in an indexing engine (we do this currently ourselves with some limitations) at time of indexing we resolve the fields into a index (shown simplified below). E.g

data_name_index: {
  "myname": ["myid"]
}

data_myotherinfo_index: {
  "myotherinfo: ["myid"]
}

It is similar in operation to how elasticsearch flattens documents and there is minimal cpu overhead (if any) at indexing time and no GC overhead. The right index is again selected at query time of course.

I might need to look at your code to see whether I am missing something though.

Thanks for your feedback.

kkirby commented 6 years ago

@matthiasg That's a good point. I suppose it all depends on how Lunr would end up storing the data. My thought process was that it would end up taking Lunr longer to detect and parse out the proper indexes based on your configuration vs having a flat tree of data. Though, I see what you mean about creating copies of objects and how that would trigger a GC, taking up CPU time.

olivernn commented 6 years ago

I think its a fair point that there is an overhead in creating the documents in the format required for Lunr. How much of an overhead is difficult to know without some benchmarks/traces etc. My guess is it would be insignificant compared to the amount of work in actually building the index though.

@matthiasg If Lunr did support this, what would you propose the interface would look like? The simple approach would probably be to pass some kind of extractor function when defining the field, perhaps something like this:

this.field('name', {
  extractor: function (doc) {
    return doc.data.name
  }
})

An alternative might be something like JSONPath, though I think that would add a bunch of complexity that probably isn't worth it.

This would only require a small amount of extra state at build time, the index wouldn't need to change I don't think. The changes would be isolated to lunr.Builder.

matthiasg commented 6 years ago

an extractor function would allow for some very customized behavior of course.

I would have simply made it default to follow the nested value e.g this.field('data.name') assuming that data.name is a the propertyName/Path as it means to follow the lookup as in

const nestedValueFromObject = nestedPropertyName.split('.').reduce((object, propertyName) => object[propertyName], object)

The indexed field name would simply be data.name including the .

EDIT: Of course the extractor function should also get the name of the field as a parameter. That way a generic function would be more easily implemented.

matthiasg commented 6 years ago

@olivernn I agree as for the overhead, but we have developed a specific software platform were entire data large trees can be taken offline with a minimum of thousands of objects and potentially tens of thousands locally.

The GC pressure for us is already pretty high in load/import scenarios and thus we try to always use algorithms that don't accrue more load if not necessary and if close wrt implementation overhead.

olivernn commented 6 years ago

I would have simply made it default to follow the nested value

Thats basically JSONPath then right? You would also need to handle arrays etc. I think this way pushes a bunch of complexity into Lunr that I don't really want to maintain.

EDIT: Of course the extractor function should also get the name of the field as a parameter. That way a generic function would be more easily implemented.

You could still make the extractor function generic without passing it the field name:

function dataExtractor = function (fieldName) {
  return function (doc) {
    return doc.data[fieldName]
  }
}

this.field('name', { extractor: dataExtractor('name') })
this.field('myotherinfo', { extractor: dataExtractor('myotherinfo') })
olivernn commented 6 years ago

@matthiasg unless it wasn't clear, I'm open to a pull requests that adds this feature 😉

I'm not 100% on the name extractor yet but I think passing an optional function that extracts the field from a document is a reasonable implementation.

olivernn commented 6 years ago

I went ahead and implemented the extractor function on fields. It is available in 2.3.0.

matthiasg commented 6 years ago

@olivernn great !

Though I would still have passed the field name, since I now have to create a lot of functions (one for each unique name at least) which again will increase resource usage :) and this time it is even more difficult to understand the actual GC impact, since code will be compiled and I am not sure how well browsers will collect the compiled code once it is not needed anymore (could not find good info on this yet).

I might still add this later in a PR when I get around to it.

olivernn commented 6 years ago

@matthiasg How many fields are you indexing?

increase resource usage

Of course a function instance will consume some resources, I think its probably an insignificant amount though. That is unless I'm missing something about your implementation?

As for the impact on the GC, the extractor functions will need to exist as long as the builder exists, so they should be all discarded when the builder goes out of scope.

Without data collected form profiling we're just speculating though, data wins here ;)

matthiasg commented 6 years ago

Data always wins :)

Since we have hundreds of forms (mostly user generated) we do have an unspecified number of form/fields.

But just once more, I have no trouble believing this is an non-issue resource-wise, but when we are faced internally with two ways to implement, one with short lived objects or closures and one without, we try to choose the latter.

Again. Thanks for this constructive discussion and even implementing it (I did not expect that) !

tuanluu-agilityio commented 6 years ago

extractor works fine

     this.field('country', {
        extractor: function (doc = {}) {
          return doc.address.country.name
        }
      })
      this.field('city', {
        extractor: function (doc = {}) {
          return doc.address.city_name
        }
      })
      this.field('commission', {
        extractor: function (doc = {}) {
        return doc.commission && doc.commission.value
      }})

Thanks @olivernn

tibabit commented 4 years ago

What about position and index metdata here? How is that expected to be computed and added into token metdata?