katabase / Application

Web app and API of the Katabase/MSS project.
https://katabase.huma-num.fr/
GNU General Public License v3.0
0 stars 0 forks source link

`/Search` route and `reconciliator()` takes expontentially long with lots of candidates for reconciliation #1

Open paulhectork opened 2 years ago

paulhectork commented 2 years ago

issue description

what the title says. when a lot of candidates have been selected and need to be reconciliated, the reconciliation process takes exponentially long. this is a problem because reconciliation() is used in the search engine of the human readable website (in the /Search route). the problem seems to be in the double_loop() function. for example:

although this problem occurs rarely (I noticed it after working on the website for months), it is unsuitable for a client-exposed function, because the client will virtually never wait 5+ minutes for a response. it also causes a server-side problem, because the application will continue to run the search even after the client has quit the page. this could cause strain on the app if there are several pending requests to process.

technical problem

the /Search route calls reconciliator() to group different occurrences of the same manuscript together. this function works in two times:

solutions

i've thought of two possible solutions:

to be continued...

paulhectork commented 2 years ago

to send asynchronous responses from a flask server to a client: https://www.shanelynn.ie/asynchronous-updates-to-a-webpage-with-flask-and-socket-io/

matgille commented 2 years ago

When I (poorly) designed the function, if I recall, the idea was to perform the reconciliation upstream (on a regular basis for instance, or to re-conciliate again when more items would be added to the database), and to give access to the results of the reconciliation only.

In my mind it made no sense to trigger the function each time a request was send to the server.

Hope it will be of help !

Best of luck with this project,

Matthias

Edit: this would work if the authors are all pre-identified, of course (but I'm not sure that reconciliating on strings is really more interesting than reconciliating on identified entities). Another assumption was that the reconciliation had to be made on authors, because this type of data fitted best with the task (it would exclude the right amount of entries).

paulhectork commented 2 years ago

hi ! thanks for letting me know about the context in which the reconciliator was created ! from what I see on the app's code, the reconciliation function is basically the website's search engine. it's a fine solution in 99.9% of the the times and I only encountered this problem once.

I've already tried to perform the reconciliation on all authors upstream, but it took ~50 hours to perform, so I've abandoned it for now and am not sure I'll be able to do it at all.

so far, the reconciliation on demand is a fine working solution to a problem, and redoing the whole search engine would be quite a hassle. I was just cleaning up some stuff on the website and opened this small issue to come back to it later. the bug would be I hope very easy to fix.

best, Paul