Indexes are emptied before data is populated, affecting live site

jrunestone commented 1 year ago

Which Umbraco version are you using? (Please write the exact version, example: 10.1.0)

11.4.1

Bug summary

When Umbraco calls upon an index populator to rebuild an index it first clears the index and then calls the populator, leaving the front facing site empty of items where it's querying this index. This becomes a big problem if you have an index populator that takes a bit of time fetching and preparing the items to insert, or if the populator fails.

Specifics

Our use case might not fit into the intended behavior of custom search indexes, or maybe it does, this is how we tend to solve things on other platforms:

We have a dependency of an external source of content, let's call it news articles.
We need to handle, search and display these news articles on our Umbraco site.
Importing and creating pages in the content tree is overly complicated and results in thousands of nodes.
We create a custom index and a populator that fetches these articles from an external endpoint and inserts them into the index.
We can then query the index very quickly.
This needs to be done regularly, during daytime and peak traffic.

The problem arises when we need to rebuild the index with new content. Umbraco first clears the index and then calls the populator's sole method to fetch and insert items into the index. If it takes a minute, our page displaying the content from the index is completely blank until the index is rebuilt. If the populator fails, the index is left empty.

The API for this particular part of the code consists of a few private methods unfortunately so it's quite hard to roll your own. I feel it's redundant to have a cache layer on top of this index, since that's basically what the index is already.

It would be prudent to include a "pre-fetch" method on the populator, or an option to handle the clearing of the index within the populator itself. That way we could spend a minute, 30s or whatever to fetch our items and then quickly rebuild the index. The built-in content index populator does pagination runs to index the items a few at a time but we're unable to do this. My temporary fix is to override CreateIndex and noop it, and then in my populator call a new method on the index class DeferredCreateIndex (or whatever) right before we index the fetched items which calls base.CreateIndex.

If I submitted a PR for this would it be considered? I'm sure it could be made non-breaking/backwards-compatible (opt-in).

Steps to reproduce

Create a custom index
Create an index populator for this index
Sleep for some time in the PopulateIndexes method before filling it with dummy content
Observe the index being empty while rebuilding

Expected result / actual result

No response

github-actions[bot] commented 1 year ago

Hi there @jrunestone!

Firstly, a big thank you for raising this issue. Every piece of feedback we receive helps us to make Umbraco better.

We really appreciate your patience while we wait for our team to have a look at this but we wanted to let you know that we see this and share with you the plan for what comes next.

We'll assess whether this issue relates to something that has already been fixed in a later version of the release that it has been raised for.
If it's a bug, is it related to a release that we are actively supporting or is it related to a release that's in the end-of-life or security-only phase?
We'll replicate the issue to ensure that the problem is as described.
We'll decide whether the behavior is an issue or if the behavior is intended.

We wish we could work with everyone directly and assess your issue immediately but we're in the fortunate position of having lots of contributions to work with and only a few humans who are able to do it. We are making progress though and in the meantime, we will keep you in the loop and let you know when we have any questions.

Thanks, from your friendly Umbraco GitHub bot :robot: :slightly_smiling_face:

kjac commented 1 year ago

Hi @jrunestone,

Thanks for reaching out 😄

Your use case makes perfect sense. And you're absolutely right; importing external content into the content tree for no other reason than "having them available" is definitively not the way to go.

There are a few ways to work around (or limit) the downtime you're currently experiencing:

Only add/update/delete the relevant items in the index instead of rebuilding it every time. This will in many cases be the more performant way to go, and likely also put less strain on the data source. Additionally it's probably the greener choice 💚
Fetch the data from the data source asynchronously, and have it ready in a local cache before rebuilding the index. Essentially this is the same as your "pre-fetch" suggestion, only you'd have to build it outside of the populator (for now).

That being said, I would be very interested in seeing a PR for this. However, it would probably be a good idea to discuss the approach, before you start pouring hours into building something. A "true" zero downtime solution would be to rebuild the index on the side and swap it somehow (like you would with Elastic). I doubt that's doable with Lucene, though.

I'll pass this to the team for internal discussion. We'll get back to you here 👍

jrunestone commented 1 year ago

Thanks for the suggestions!

1: Unfortunately we still need to fetch everything first to be able to tell which items need crudding, the actual rebuilding of the index is blazing fast.

2: Is a good workaround, however I already implemented the workaround I mentioned where I override CreateIndex and call it myself in the populator when it's ready :P Not future proof but less code right now to solve the issue anyway.

Looking forward to hearing what the team has to say. :+1:

bergmania commented 1 year ago

Hi @jrunestone

We discussed it internally and your "pre-fetch" method at least seems better than what we have today. Our biggest concern is memory usage, if you have to save all data in memory while building the index. So we think it will be good as an opt-in feature.

umbraco / Umbraco-CMS