PaNOSC Search Scoring V2.x

nitrosx commented 1 year ago

This PR implements incremental weight computation, which should scale better to a higher number of datasets. It still needs a lot of testing and in-depth review. I created so other people can work on it, as I do not have time to work on this at the moment.

VKTB commented 1 year ago

Hi @nitrosx, posting my findings here as they will hopefully be useful to anyone that may be testing/reviewing/working on this PR.

From your last email, my understanding is that the /compute endpoint does not necessarily have to be used for the weights to be computed anymore. If I understood you correctly this PR changes the logic so that when a new item is inserted or an existing one is updated, the database should automatically update all the components of the weights that are influenced by the update. Also, when a query is sent, the database should compute the weights (on the fly) of the words extracted from the query and present in the items, and return the relevant ones.

Listed below are the things I did and my findings:

I deleted the whole database and started from scratch.
I updated the configuration file to include incrementalWeightsComputation which I set it to True.
Populated the search scoring component with 100 documents by sending a POST request to the /items endpoint which included the documents.
I know that one of the documents has an id pid:123 and a summary which starts like this This proposal is part of a ... so I sent a POST request to the /score endpoint with the following JSON {"query": "This proposal is part of a"}.

I was expecting to get the item (along with a score) back that has an id pid:123 and a summary This proposal is part of a ..., however, as shown below I did not get any items back.

{
"request": {
    "query": "This proposal is part of a",
    "itemIds": [],
    "group": "",
    "limit": -1
},
"query": {
    "query": "This proposal is part of a",
    "terms": [
        "propos",
        "part"
    ]
},
"scores": [],
"dimension": 0,
"computeInProgress": false,
"started": "2023-02-22T16:12:22.575992",
"ended": "2023-02-22T16:12:22.581431"
}

I also tested it by specifying all the parameters in the JSON ({ "query": "This proposal is part of a", "group": "Documents", "limit": 1000, "itemIds": ["pid:123"] }), but I did not get any items back.
I also did not get any items back after modifying the values of some of the fields of the document that has an id pid:123 and a summary This proposal is part of a .... It's worth me mentioning that the PATCH /items/<id> endpoint is not doing what it is expected to do because it updates the entire item rather than updating the values of the fields supplied in the request.
I was constantly checking the database when I was populating it with items and sending score requests and I could only see the items collection in it so no collections for weights etc.

nitrosx commented 1 year ago

@VKTB thank you so much for testing the new version and the details. Would you be able to do the following on your testing environment:

Make a GET on /items and see if you get all your items back
Make a GET on /terms/count and check how many terms have been extracted
Make a GET on /terms and check the output.

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Let me know

VKTB commented 1 year ago

@nitrosx Thank you for your reply.

Make a GET on /items and see if you get all your items back

Yes, I can see all the items that I posted to the search scoring component

Make a GET on /terms/count and check how many terms have been extracted

I get a 500 - Internal Server Error

Make a GET on /terms and check the output.

I get an empty list back ([]) presumably because no terms have been created when I inserted the items or modified the item with id id:123?

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Like I said in my previous comment, I can only see the items collection in the database so no collections for weights, tf, idf etc. so not sure why this is the case.

VKTB commented 1 year ago

Hi @nitrosx, I thought I would post my findings here as well in case they are useful to anyone that may be testing/reviewing/working on this PR.

I pulled the latest changes from the v2.x branch and modified the docker-compose.yml file to build from the Dockerfile to ensure that the Docker image uses the latest code changes. I then tried testing the changes but I am getting the following error when I post an item of group Documents to the /items endpoint: 400 Bad Request – An exception of type TypeError occurred. Arguments:\n('string indices must be integers',).

I can see that the item gets added to the items collection in the database but from the entry (see below) in the status collection, the computation seems stuck because it hasn’t changed for the past 30 minutes.

[
  {
    _id: ObjectId("64a3ed5180fa4d2d2668250e"),
    inProgress: true,
    incrementalWeightsComputation: true,
    progressDescription: 'Computing weights TF',
    progressPercent: 0.2
  }
]

panosc-eu / panosc-search-scoring

PaNOSC Search Scoring V2.x #29