inveniosoftware / invenio-collections

Invenio module for organizing metadata into collections.
https://invenio-collections.readthedocs.io
GNU General Public License v2.0
2 stars 19 forks source link

receivers: bad performance of _find_matching_collections_internally #72

Open jacquerie opened 8 years ago

jacquerie commented 8 years ago

On INSPIRE we have roughly 20 collections defined as queries: https://github.com/inspirehep/inspire-next/blob/5b7207c8ee090658f23b818168fcef31d846e139/inspirehep/config.py#L150-L235. Using invenio-collections has led to a 3x slowdown in the migration of our nightly (http://inspire-nightly.cern.ch/), which means we don't get a full set of migration errors each night.

Here's a migration profile of a task adding 2100 records: https://cernbox.cern.ch/index.php/s/D0rPrCM68FaqGPh. The culprit is clearly get_record_collections, partly because of _build_cache, partly because of _find_matching_collections_internally. The issue with _build_cache has been fixed by #69, so only the other one remains.

Now, if we go inside _find_matching_collections_internally no method appears to be particularly slow: it's just that creating a query and seeing if a record matches are quite expensive operations, and doing it 20 times per record is what makes it slow.

Truth to be told, running a query for copying a value from collections.primary to _collections seems a little silly to me, so I'd like to inject my own behaviour here, and falling back to _find_matching_collections_internally only for the actual queries.

NB: _find_matching_collections_externally has better performance, but it puts a lot of pressure on ES, which makes it lose record inserts...

Proposal

jirikuncar commented 8 years ago

:+1: for COLLECTIONS_MATCHER - consider moving it to extension state as matcher property.

jacquerie commented 7 years ago

Alternate Proposal

Add an extra parameter copy to the query configuration. If set to True then it short circuits this line, skipping the expensive query building: https://github.com/inveniosoftware/invenio-collections/blob/4c5fd23a4c3fd24dec92f6215179f2098a684ee5/invenio_collections/receivers.py#L69