Open nitrosx opened 1 year ago
Hi @nitrosx, posting my findings here as they will hopefully be useful to anyone that may be testing/reviewing/working on this PR.
From your last email, my understanding is that the /compute
endpoint does not necessarily have to be used for the weights to be computed anymore. If I understood you correctly this PR changes the logic so that when a new item is inserted or an existing one is updated, the database should automatically update all the components of the weights that are influenced by the update. Also, when a query is sent, the database should compute the weights (on the fly) of the words extracted from the query and present in the items, and return the relevant ones.
Listed below are the things I did and my findings:
incrementalWeightsComputation
which I set it to True
.POST
request to the /items
endpoint which included the documents.pid:123
and a summary which starts like this This proposal is part of a ...
so I sent a POST
request to the /score
endpoint with the following JSON {"query": "This proposal is part of a"}
.pid:123
and a summary This proposal is part of a ...
, however, as shown below I did not get any items back.
{
"request": {
"query": "This proposal is part of a",
"itemIds": [],
"group": "",
"limit": -1
},
"query": {
"query": "This proposal is part of a",
"terms": [
"propos",
"part"
]
},
"scores": [],
"dimension": 0,
"computeInProgress": false,
"started": "2023-02-22T16:12:22.575992",
"ended": "2023-02-22T16:12:22.581431"
}
{ "query": "This proposal is part of a", "group": "Documents", "limit": 1000, "itemIds": ["pid:123"] }
), but I did not get any items back.id
pid:123
and a summary This proposal is part of a ...
. It's worth me mentioning that the PATCH
/items/<id>
endpoint is not doing what it is expected to do because it updates the entire item rather than updating the values of the fields supplied in the request.items
collection in it so no collections for weights etc.@VKTB thank you so much for testing the new version and the details. Would you be able to do the following on your testing environment:
You could always connect to the database directly and see if there is any entry in the tf and idf collections
Let me know
@nitrosx Thank you for your reply.
- Make a GET on /items and see if you get all your items back
Yes, I can see all the items that I posted to the search scoring component
- Make a GET on /terms/count and check how many terms have been extracted
I get a 500 - Internal Server Error
- Make a GET on /terms and check the output.
I get an empty list back ([]
) presumably because no terms have been created when I inserted the items or modified the item with id id:123
?
You could always connect to the database directly and see if there is any entry in the tf and idf collections
Like I said in my previous comment, I can only see the items
collection in the database so no collections for weights, tf, idf etc. so not sure why this is the case.
Hi @nitrosx, I thought I would post my findings here as well in case they are useful to anyone that may be testing/reviewing/working on this PR.
I pulled the latest changes from the v2.x
branch and modified the docker-compose.yml
file to build from the Dockerfile
to ensure that the Docker image uses the latest code changes. I then tried testing the changes but I am getting the following error when I post an item of group Documents to the /items
endpoint: 400 Bad Request – An exception of type TypeError occurred. Arguments:\n('string indices must be integers',)
.
I can see that the item gets added to the items
collection in the database but from the entry (see below) in the status
collection, the computation seems stuck because it hasn’t changed for the past 30 minutes.
[
{
_id: ObjectId("64a3ed5180fa4d2d2668250e"),
inProgress: true,
incrementalWeightsComputation: true,
progressDescription: 'Computing weights TF',
progressPercent: 0.2
}
]
This PR implements incremental weight computation, which should scale better to a higher number of datasets. It still needs a lot of testing and in-depth review. I created so other people can work on it, as I do not have time to work on this at the moment.