`/Search` route and `reconciliator()` takes expontentially long with lots of candidates for reconciliation

paulhectork commented 2 years ago

issue description

what the title says. when a lot of candidates have been selected and need to be reconciliated, the reconciliation process takes exponentially long. this is a problem because reconciliation() is used in the search engine of the human readable website (in the /Search route). the problem seems to be in the double_loop() function. for example:

when searching for Sévigné, 52 items are matched for a reconciliation. double_loop() takes ~1 second and the waiting time for /Search is ~5 seconds long (which is fine).
when searching for Napoléon, 112 items are matched. reconciliation takes 4'' and the whole search takes rougly 7-10 seconds (a bit long, but still fine).
when searching for bonaparte, there are 740 matches. double_loop() takes 5'19'' to be processed and the whole time taken for /Search takes rougly as much time.

although this problem occurs rarely (I noticed it after working on the website for months), it is unsuitable for a client-exposed function, because the client will virtually never wait 5+ minutes for a response. it also causes a server-side problem, because the application will continue to run the search even after the client has quit the page. this could cause strain on the app if there are several pending requests to process.

technical problem

the /Search route calls reconciliator() to group different occurrences of the same manuscript together. this function works in two times:

first, it filters all the catalogue entries based on the client's query (an author with author_filtering() and a date with date_fitering()). it matches a certain number of items that need to be reconciliated. the number of items matched don't impact the processing time.
then, double_loop() is called to group matched items. this is what takes so much time.
- in more detail, double_loop() loops over all matched items, and, for each of those items, loops once more over all matched items. so if x items are matched, the total number of iterations is x^2.

solutions

i've thought of two possible solutions:

continue with the reconciliation, but flash a message to the client saying that it will take a lot of time (which would require sending asynchronous data from client to server, so not that ease)
not reconciliate above a certain number of items matched. in this case, all matching catalogue items are shown, without grouping together items that represent the same manuscript. in this case, we should change the way search results are presented, since the HTML response always allow to view reconciliated manuscripts (see View by manuscript button).

to be continued...

paulhectork commented 2 years ago

to send asynchronous responses from a flask server to a client: https://www.shanelynn.ie/asynchronous-updates-to-a-webpage-with-flask-and-socket-io/

matgille commented 2 years ago

When I (poorly) designed the function, if I recall, the idea was to perform the reconciliation upstream (on a regular basis for instance, or to re-conciliate again when more items would be added to the database), and to give access to the results of the reconciliation only.

In my mind it made no sense to trigger the function each time a request was send to the server.

Hope it will be of help !

Best of luck with this project,

Matthias

Edit: this would work if the authors are all pre-identified, of course (but I'm not sure that reconciliating on strings is really more interesting than reconciliating on identified entities). Another assumption was that the reconciliation had to be made on authors, because this type of data fitted best with the task (it would exclude the right amount of entries).

paulhectork commented 2 years ago

hi ! thanks for letting me know about the context in which the reconciliator was created ! from what I see on the app's code, the reconciliation function is basically the website's search engine. it's a fine solution in 99.9% of the the times and I only encountered this problem once.

I've already tried to perform the reconciliation on all authors upstream, but it took ~50 hours to perform, so I've abandoned it for now and am not sure I'll be able to do it at all.

so far, the reconciliation on demand is a fine working solution to a problem, and redoing the whole search engine would be quite a hassle. I was just cleaning up some stuff on the website and opened this small issue to come back to it later. the bug would be I hope very easy to fix.

best, Paul

katabase / Application