mozilla / elasticutils

[deprecated] A friendly chainable ElasticSearch interface for python
http://elasticutils.rtfd.org
BSD 3-Clause "New" or "Revised" License
243 stars 76 forks source link

Add efficient nested document and array update mechanism that is append friendly. #213

Closed jmizgajski closed 10 years ago

jmizgajski commented 10 years ago

Consider the following scenario:

We have a:

[post] -1-------N- [comment]

relationship in the django ORM. We would like to search (or MLT) posts based on the comments they include and therefor we need the post document to include them as an array of nested documents or strings.

Obviously we need the post index to update whenever a new comment is added.

How it can be achieved right now:

  1. We add a documment extraction that serializes all comments into an array attached to a post.
  2. We add a indexing action on the post MappingType whenever a comment is saved via the signals api.
  3. We iterate through all comments and recreate the array.

Now imagine the number of comments is very large and save frequency is very large. This renders both our db and elasticsearch as one is handling expensive queries all the time and the other is reindexing all the time.

How this should work:

  1. We define an appendable field.
  2. We define single element extraction.
  3. We define a signal that should extract the comment using 2. and append it to a provided field

P.S. I'm willing to help out to enable this since we need it very badly.

willkg commented 10 years ago

The index scaffolding in ElasticUtils is really rough and designed to meet common use cases. This use case is probably not that common.

I think you should fix your issue for your situation. I'd be interested in seeing how you solve it, but I'm on the fence about whether it's a feature that needs to be in ElasticUtils.

Does anyone else think this needs to be in ElasticUtils? Is there a way we could tweak the existing API to make fixing this easier for someone, but not otherwise fix it in ElasticUtils?

jmizgajski commented 10 years ago

I think that an ability to define custom update actions would be useful in many cases. It would be all that is needed to implement this in a minimalistic way, since we already have all other parts.

I've read a bit about appending in ES and it seems that it would only solve the issue with the DB since ES reindexes the whole document anyway. Another solution would be to use parent-child documents (can it be done in elasticutils?), but I'm not sure what's the limitation of this solution are going to be.

Btw. love your work. We are currently switching to your solution from Haystack, hope it works out:)

tomgruner commented 10 years ago

parent-child seems like the elasticsearch solution - this would be much more efficient for elasticsearch.

Elasticsearch writes a new document segment when any field value is changed since documents are stored in immutable segments. So you could end up making elasticsearch do a lot of extra work and i/o if you frequently modify an array.

The less i/o intensive solution would be adding or updating child documents as needed.

You can read a bit about this here:

https://www.found.no/foundation/keeping-elasticsearch-in-sync/#the-problems-of-too-frequent-updates-and-non-batch-updates

Also this is interesting reading http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling

That said - I don't know what options are in elasticutils for parent/child

jmizgajski commented 10 years ago

Great answer! Thank you! Thats exacly what I did in my Issue for Scalable and testable django support