ral-facilities / datagateway

DataGateway is a portal that supports discovery and access large science facilities data. It is developed as a plugin for SciGateway
Apache License 2.0
9 stars 5 forks source link

Free text search results cannot be ordered by relevance #1152

Open patrick-austin opened 2 years ago

patrick-austin commented 2 years ago

Description: When performing a free text search, Lucene generates a score associated with each entity ID. For example: [ { id: 2, score: 0.9 }, {id: 3, score: 0.7}, { id: 1, score: 0.5 } ]. These ids are returned, in descending order by score, by: https://github.com/ral-facilities/datagateway/blob/be99284297aaded4f26efdc691c42ade8cb72db9/packages/datagateway-common/src/api/lucene.tsx#L63-L86 We then use these to perform the DB query to get the entities (e.g. /investigations?where={id: {in: [2, 3, 1]}}): https://github.com/ral-facilities/datagateway/blob/be99284297aaded4f26efdc691c42ade8cb72db9/packages/datagateway-search/src/table/investigationSearchTable.component.tsx#L75-L108 However, the order of the returned entities does not reflect this order, and cannot be obtained directly from the DB as it does not have any concept of the Lucene score.

Other sorting is done serverside as part of the query, for example the default behaviour of sorting by ID. This has advantages, mainly lazy loading since you don't have to fetch all 300 results and only fetch as many need to be displayed.

Sorting by score would have to be done clientside using the list of IDs from Lucene, however currently we lazily load or paginate the results, so that we have the first 50 by the sorting criteria loaded in the table. However there is no guarantee these are the best results, and so even sorting clientside by score on these would not necesarily give the best results first.

Possible changes:

If sorting by score was implemented, how would this interact with the table filters/sort? For sorting by date say, we now wouldn't want only the best 50 ids by score, but all 300 for that DB query (which wouldn't be available in the 2nd approach). In both the latter two approaches, we would have to switch from manually sending the IDs in batches based on score to sending all of them whenever a table filter was applied. However it's worth noting that even now, sorting by date won't give the most recent results matching the search query, it will give the most recent of the 300 results that best matched the search query. So already this is perhaps not what a user would expect. While not currently implemented, Lucene and its derivatives support sorting by non-score fields (i.e. we could achieve the former behaviour in principle by relying on Lucene for both searching and sorting).

Filtering might also be problematic, as if I only query on the top 50 results by score, with a filter which removes 49 of those results, I'm going to need to send another query to get more results straight away. Currently, as we send all 300 ids as part of the query you get 50 results that definitely match the filter without need for subsequent queries. Having said this, if the user wanted a more accurate result, they could do this using the free text search itself provided the relevant fields are indexed in Lucene.

Further discussion is welcome.

Acceptance criteria:

joelvdavies commented 2 years ago

On first inspection I thought the second option is more scalable long term as we may not always have the 300-result limit and we wouldn’t want to harm the client performance too much by sorting everything client side in the first option. But it sounds like the filtering wouldn’t work correctly since all 300 are needed anyway. As a result, the third option is the only one that would retain the lazy loading and sounds like a better combination of both? Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene?

patrick-austin commented 2 years ago

Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene?

Thanks for your thoughts @joelvdavies. Sorting by score as part of a DB call (in the way we do in the table filters by sort={"title"%3A"asc"} or whatever) isn't possible as the DB has no concept of the score. As I mention in the rambling towards the end, you can do sorting and filtering by fields with Lucene itself in principle (e.g. I could search for (extremely relevant neutron scattering data) +id:2400* returning entities that "match" at least one term from the first phrase whilst also telling Lucene to "filter" and only allow results that have an ID begining 2400. By default this would return results sorted by score, but Lucene can sort by another field such as date (however this latter sorting is not exposed by our current icat.server/icat.lucene implementation). In this sense Lucene can do sorting and filtering in the backend, however as long as we need to do a subsequent DB query for the entities you would then still need to sort the entities clientside to match the list of IDs Lucene gave you (which are already ordered and filtered).

Finally it's worth mentioning that (while our current implemtation doesn't allow it) you can also get more information back from Lucene in addition to just the ID. In principle, if you indexed all the relevant data into Lucene, you could avoid doing the subsequent DB call entirely. Though that would be a substantial change. But it would be an option, and relying on Lucene for all sorting (not just the score sorting) would mean you don't have this current situation where I'm only sorting the top 300 results by score in the front end, rather than the top 300 most recent, alphabetically first etc.