Scalable and testable support for Django

jmizgajski commented 10 years ago

Hi there,

Our project is quite a strech for elasticutlis since documents are very frequently updated and inserted and everything has to be testable, including things done via elasticutils in Elasticsearch. That's why took some time to extend elasticutils with a proper infrastructure that meets these requirements, that we would like to give back to the community.

The purpose of this issue is to discuss our approach and suggest areas that have to be improved before we do a legitimate pull request with this (major) chunk of functionality.

Here you can find all our code with quite thorough coverage.

Once we reach an agreement I'll remove all dependencies (mainly utils, and our fork of hot_redis) and do a proper PR.

Below you will find a quick overview of our approach and references that explain some architectural decisions

BatchIndexable

This class is intended to be a replacement of using MappingType and Indexable when integrating with Django. It:

does not require to have indices and mapping types set up in Django settings, they are now where they belong, in BatchIndexable, which is inspired by django.models.Model
DjangoElasticsearchTestCase is aware of all indices and can prefix them for tests so tests can be run on the same es instance as production or development and won't break anything
each BatchIndexable can have custom index settings
supports defining custom rules of weather a document should be indexed
supports efficient (distributed via celery) indexing of all indexable documents
supports periodic, distributed updates and inserts based on Redis queues and sets, that can be set-up with celery heartbeat
DjangoElasticsearchCase

This is a very convenient TransactionCase, that prefixes all indexes defined by BatchIndexables, makes sure they exist during tests and cleans up after itself.

When you need to make sure all documents have been indexed before moving forward in you test it provides refesh_all() method that will make sure everything you did with BatchIndexables will be reflected in elasticsearch.

es_setup

Is a collection of helper methods, that are used by DjangoElasticsearchCase and a command that can be used when deploying your application to production:

class Command(BaseCommand):

    def handle(self, *args, **options):
        if not getattr(settings, 'ES_URLS', None):
            raise CommandError('Elasticsearch url not found')

        #TODO make deleting registred indices optional

        #deletes all indices defined for BatchIndexables
        delete_registered_indices()

        #sets up all indices defined for BatchIndexables
        set_up_registered_indices_and_types()

        #indexes all documents returned by get_indexable() for each 
        # BatchIndexable
        index_all_documents(blocking=True)

        #waits for all indices to finish indexing
        refresh_registered_indices()

indexing_queues

Consist of a regular queue used for handling inserts and an id set used for handling frequent updates. They are based on hot_redis datatypes but can be easily abstracted to use any persistence/cache layer as long as it supports queue and set datatypes.

The architecure is base on this post.

In short:

Inserting is handled by queues that keep pickled documents (the return of BatchIndexable.extract_document(obj_id, obj=None)) since the full model usually would be available in the post_save signal and by extracting it in the receiver we save an extra db request.

Updating is handled by sets that keep ids of models that have been updated. This is a robust way of handling denormalized data (this is what you usually end up with in elasticsearch) like counters, arrays, or nested documents.

This is best explained with use of an example. Lets say we have a post mapping type which has a comments counter (used for boosting) and a lot of text. Normally you would receive a post_save signal when a comment is created and call extract_document() on the post which would mean analyzing all the text just to increment the comments counter by one. Now imagine each post gets hundreds of comments per second. With our approach the id that is alleady available in the post_save of the comment is added to a set (marked as dirty). The set is processed periodically (for example each 30 seconds) and all dirty documents are extracted and indexed using the batch api. We get from possibly hundreds of thousands of indexing requests to few elasticsearch batch requests and from hundreds of thousands additional db requests to just a few.

jmizgajski commented 10 years ago

some reply would be nice;)

willkg commented 10 years ago

In skimming this, this is a huge project with a lot of bits and it seems like it rearchitects portions of ElasticUtils, too. I don't have time to work through this right now. I don't know offhand when I'll be able to.

jmizgajski commented 10 years ago

It does not break elasticutils in any way (unless we assume BatchIndexable would replace Indexable), it builds upon it and reuses most what has already been done in terms of django integration.

Maybe some other collaborators would be willing to work it through? Delaying it would only cause a merge to be much harder once elasticutils and my code progress.

jmizgajski commented 10 years ago

Hi there, I've just noticed some asynchronity issues (came up in production) in the current version so you can refrain from reviewing it until I fix it.

Best of luck

jmizgajski commented 10 years ago

An updated version (working in production, fully distributed, with heartbeat support for flushing document queues into elasticsearch) can be found here

to define heartbeat tasks you have to:

Define individual tasks for each mapping (since you may want to flush it with different frequency), i.e you have user and item mappings. These tasks support distributed processing but unless your influx is higher than few thousands per second I would advice sticking with synchronous tasks for keeping elasticsearch in sync (lowers celery overhead).

@app.task(base=Task)
def index_users(lock_timeout, async=False):
    index_documents_for_mapping_type(UserMapping, lock_timeout, async=async)

@app.task(base=Task)
def index_items(lock_timeout, async=False):
    index_documents_for_mapping_type(ItemMapping, lock_timeout, async=async)

Configure your celery heartbeat to fire these tasks periodically


CELERYBEAT_SCHEDULE = {
    'index_items': {
        'task': 'recommendation.tasks.index_items',
        'schedule': datetime.timedelta(seconds=2),
        'args': (8,)
    },
    'index_users': {
        'task': 'recommendation.tasks.index_users',
        'schedule': datetime.timedelta(seconds=1),
        'args': (4,)
    },
}

Hope you guys like it, in our tests its super fast:)

jmizgajski commented 10 years ago

Will any chance for some comments? This has been hanging here for quite some time.

willkg commented 10 years ago

It has been sitting around for a while. I haven't had time to spend on this. Maybe in July, but I can't make any promises.

Right now my priorities are to get 0.10 out asap because that's totally fux0ring everyone. I have no idea whether the issues you're fixing here affect other people or not. It'd be nice to find out if anyone else is affected and if so, by what specific aspects. Knowing that might adjust my priorities.

jmizgajski commented 10 years ago

Ok, thanks for an honest reply. We have already advanced it to allow indexing really massive tables, so once you feel you are ready to talk about it just let me know - I will update the code in this issue (some changes will follow soon, so no point in wasting my time on it now).

In terms of interest I have a consulting enterprise project on the side (beside my startup) and they seem to be pretty interested in this functionality as well, so if you decide that you don't want to go with my solution I will probably have to fork away;). If you have other channels to reach "customers" I would appreciate any feedback, elasticutils did us a great favour and I'd like to repay my debt by adding some features;)

2014-06-23 23:06 GMT+02:00 Will Kahn-Greene notifications@github.com:

It has been sitting around for a while. I haven't had time to spend on this. Maybe in July, but I can't make any promises.

Right now my priorities are to get 0.10 out asap because that's totally fux0ring everyone. I have no idea whether the issues you're fixing here affect other people or not. It'd be nice to find out if anyone else is affected and if so, by what specific aspects. Knowing that might adjust my priorities.

— Reply to this email directly or view it on GitHub https://github.com/mozilla/elasticutils/issues/235#issuecomment-46902106 .

mozilla / elasticutils