panosc-eu / panosc-search-scoring

3 stars 3 forks source link

PaNOSC Search Scoring V2.x #29

Open nitrosx opened 1 year ago

nitrosx commented 1 year ago

This PR implements incremental weight computation, which should scale better to a higher number of datasets. It still needs a lot of testing and in-depth review. I created so other people can work on it, as I do not have time to work on this at the moment.

VKTB commented 1 year ago

Hi @nitrosx, posting my findings here as they will hopefully be useful to anyone that may be testing/reviewing/working on this PR.

From your last email, my understanding is that the /compute endpoint does not necessarily have to be used for the weights to be computed anymore. If I understood you correctly this PR changes the logic so that when a new item is inserted or an existing one is updated, the database should automatically update all the components of the weights that are influenced by the update. Also, when a query is sent, the database should compute the weights (on the fly) of the words extracted from the query and present in the items, and return the relevant ones.

Listed below are the things I did and my findings:

nitrosx commented 1 year ago

@VKTB thank you so much for testing the new version and the details. Would you be able to do the following on your testing environment:

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Let me know

VKTB commented 1 year ago

@nitrosx Thank you for your reply.

  • Make a GET on /items and see if you get all your items back

Yes, I can see all the items that I posted to the search scoring component

  • Make a GET on /terms/count and check how many terms have been extracted

I get a 500 - Internal Server Error

  • Make a GET on /terms and check the output.

I get an empty list back ([]) presumably because no terms have been created when I inserted the items or modified the item with id id:123?

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Like I said in my previous comment, I can only see the items collection in the database so no collections for weights, tf, idf etc. so not sure why this is the case.

VKTB commented 1 year ago

Hi @nitrosx, I thought I would post my findings here as well in case they are useful to anyone that may be testing/reviewing/working on this PR.

I pulled the latest changes from the v2.x branch and modified the docker-compose.yml file to build from the Dockerfile to ensure that the Docker image uses the latest code changes. I then tried testing the changes but I am getting the following error when I post an item of group Documents to the /items endpoint: 400 Bad Request – An exception of type TypeError occurred. Arguments:\n('string indices must be integers',).

I can see that the item gets added to the items collection in the database but from the entry (see below) in the status collection, the computation seems stuck because it hasn’t changed for the past 30 minutes.

[
  {
    _id: ObjectId("64a3ed5180fa4d2d2668250e"),
    inProgress: true,
    incrementalWeightsComputation: true,
    progressDescription: 'Computing weights TF',
    progressPercent: 0.2
  }
]