Open jacquerie opened 8 years ago
:+1: for COLLECTIONS_MATCHER
- consider moving it to extension state as matcher
property.
Add an extra parameter copy
to the query configuration. If set to True
then it short circuits this line, skipping the expensive query building: https://github.com/inveniosoftware/invenio-collections/blob/4c5fd23a4c3fd24dec92f6215179f2098a684ee5/invenio_collections/receivers.py#L69
On INSPIRE we have roughly 20 collections defined as queries: https://github.com/inspirehep/inspire-next/blob/5b7207c8ee090658f23b818168fcef31d846e139/inspirehep/config.py#L150-L235. Using
invenio-collections
has led to a 3x slowdown in the migration of our nightly (http://inspire-nightly.cern.ch/), which means we don't get a full set of migration errors each night.Here's a migration profile of a task adding 2100 records: https://cernbox.cern.ch/index.php/s/D0rPrCM68FaqGPh. The culprit is clearly
get_record_collections
, partly because of_build_cache
, partly because of_find_matching_collections_internally
. The issue with_build_cache
has been fixed by #69, so only the other one remains.Now, if we go inside
_find_matching_collections_internally
no method appears to be particularly slow: it's just that creating a query and seeing if a record matches are quite expensive operations, and doing it 20 times per record is what makes it slow.Truth to be told, running a query for copying a value from
collections.primary
to_collections
seems a little silly to me, so I'd like to inject my own behaviour here, and falling back to_find_matching_collections_internally
only for the actual queries.NB:
_find_matching_collections_externally
has better performance, but it puts a lot of pressure on ES, which makes it lose record inserts...Proposal
COLLECTIONS_MATCHER
configuration variable to override https://github.com/inveniosoftware/invenio-collections/blob/a8aec24353e41363ad52004e70efcf66091a5d4f/invenio_collections/receivers.py#L102-L107, and provide your own matcher.