irods / irods_capability_indexing

BSD 3-Clause "New" or "Revised" License
1 stars 11 forks source link

rapid metadata indexing can incur ~O(N^2) behavior #67

Open d-w-moore opened 3 years ago

d-w-moore commented 3 years ago

This is relevant to the new (Metalnx) metadata schema, as all metadata AVU are now kept within a list element of the index record describing the data object (or collection) they annotate.

Because of this, any scenario (but in particular automated ingest with importation/translation of many e.g. header key-value elements then being attached as AVU's by the ingest hook) in which N metadata AVU's are attached in rapid-fire fashion will spawn N delayed re-indexing tasks. If the object starts out with no AVU's (best case) we will have an average of N/2 AVU's already present per task, so the total number of iterations over AVU's tends toward N*(N/2), which is order N-squared.

d-w-moore commented 3 years ago

One way around this, with the upcoming release (4.2.10.1) of the indexing plugin, will be to use atomic metadata operations where feasible (python-irodsclient will now allow this). But ultimately, we may want to follow the approach of limiting the number of delayed tasks submitted per object, particularly when they are seen at time of submission to be redundant with later tasks already entered into the delay queue.

trel commented 3 years ago

for clarity... "later tasks already entered"... i am reading as "future tasks already entered"

d-w-moore commented 3 years ago

Correct, that was imprecisely worded... we're dealing with tasks that are known to be pending with an execution time comfortably far in the future.