msaari / relevanssi

Relevanssi, a WordPress plugin to improve the search
GNU General Public License v3.0
49 stars 21 forks source link

Request: move doc count to scheduled task #20

Closed MikeNGarrett closed 4 years ago

MikeNGarrett commented 4 years ago

Hello!

Before I say anything else, I want to thank you for publishing such an useful and well-documented plugin. I use Relevanssi free and premium for a number of clients. Thank you.

I've read your comments in a number of places stating that at a certain point Relevanssi should not handle very large data sets. I absolutely agree, but I am always interested in finding ways to increase that limit, if possible.

One of the biggest bottlenecks I've found is the database query: SELECT COUNT(DISTINCT(doc)). (code reference)

I understand this is important to calculate weights and relevance, but it seems like this can be moderately accurate and still achieve similar results. Is that true?

If so, could this count be deferred to a scheduled task that updates the option? This would increase performance pretty significantly for all uses, but especially for sites with large indexes.

msaari commented 4 years ago

That's a good point. I recently moved another very slow task to an AJAX action – whenever it was needed to be done, an AJAX action is triggered to update it and store the value in an option. A similar approach would probably work here.

You're right, an exact value is not necessary, an approximation will do fine. I'll see what is the best approach to optimize this, there's definitely some optimization possible here. I'll update this issue when I come up with something.

MikeNGarrett commented 4 years ago

Sounds good. I'd be happy to contribute an idea, if you'd accept a PR.

msaari commented 4 years ago

Yes, PRs are welcome.

msaari commented 4 years ago

Thanks. Couple of points:

1) If the database is so large or the site performs so poorly that doc_count cannot be counted, Relevanssi doesn't really need to support that. I don't mind if doc_count is slightly imprecise, but 0? That just makes the search weights meaningless, and I don't like that. In those cases using Relevanssi Light is always a better idea, and I'd rather point people to that, because that's an actual, functional search that performs much, much better and produces accurate results fast even with very large databases.

2) I generally don't like scheduled tasks and most of the time there's no reason to update the doc count every hour; there are many sites where the doc count needs an update once a week, or once a day, thus running the update action hourly just wastes resources. Also, adding the scheduled action makes it necessary to clear the hook on deactivation, which adds unnecessary code overhead (also it's confusing to have two sets of functions called relevanssi_uninstall() and _relevanssi_uninstall() that run on different occasions and do different things). So I'll go with my original idea of making the update an async action.

I agree it's stupid to count all the counts every time the indexing tab is loaded, so I'll add those other counts as options and have the async action update those as well.

MikeNGarrett commented 4 years ago

Understood.

  1. I intended for the 0 default to be in place only until the scheduled task or indexing ran which needs to happen before you have a count anyway.
  2. It would be simple to set a "needs to be counted" option that turns the scheduled task on or off. The important bit it allowing the server to handle these tasks over a longer period of time without bogging down the end user. If you'd rather use an async task to accomplish this, I think that's fine as long as this is accomplished by the server without hitting timeouts.

Another issue I was addressing in this PR was the scope of insert_edit which fires a new document count for anything that triggers an insert or update, even if the index isn't updated. For example, editing a menu calls this for each menu item in the menu.

Thank you very much for reviewing this so quickly. I really appreciate the time you've taken.

msaari commented 4 years ago

I was thinking about the relevanssi_disable_count filter hook – in a situation where that's necessary, using Relevanssi in the first place doesn't make sense, and Relevanssi Light is a much better tool.

Good point about relevanssi_insert_edit() – I modified it now so that relevanssi_async_update_doc_count() only fires if the post is published. (And not when the post is removed, because relevanssi_remove_doc() already runs it, so no need to run it twice.)