sul-dlss / dlme

Digital Library of the Middle East web application, based on Spotlight
https://dlmenetwork.org/
Other
20 stars 2 forks source link

Contributor item counts by type on contributors page differ from those in the facet view on the home page #1436

Closed jacobthill closed 2 years ago

jacobthill commented 2 years ago

Some of the item counts under type on the contributor page differ from those on the home page. e.g. Drawings has 1574 on the contributor page but following the link yields only 1096

Screen Shot 2022-02-08 at 1 50 19 PM
corylown commented 2 years ago

@jacobthill the statistics page is cached so there is bound to be some drift between the counts when the page was cached and the current state of the index. I don't think this is fixable unless we stop caching the page, and I would expect it was cached to speed up long page load times. To reduce the drift we could expire the cache at some reasonable interval.

@cbeer pointed out that he cleared the cache and was still seeing some difference in counts so I'll investigate further. Caching will inevitably cause differences in counts though.

jacobthill commented 2 years ago

@corylown I caching might be the problem. If your investigation indicates that it is, I think keeping it in place and implementing a timed recount could be one option for solving this problem. Another option (not sure if this will work technically) is to refresh the counts every time records are added. This is the only time counts should change. I believe the facet counts are also cached. Are these cached counts accessible via some global variable that can be the same in both places?

corylown commented 2 years ago

@jacobthill -- Chris and I chatted some about this and he thinks the cache is supposed be cleared when new data is indexed. Maybe that's not working as expected. Caching and cache expiration are notoriously difficult to get right. I suspect I am going to find something else is not quite right, though. I'll keep at it. I don't think there's a reasonable way to synchronize the caching happening on the results page and the stats page.

corylown commented 2 years ago

Summary

This appears to be a data issue. In cases where cho_edm_type and/or cho_has_type contain multiple values only the first cho_edm_type and first cho_edm_type:cho_has_type pair are used to generate the values of the cho_type_facet field.

Real examples:

{
"id":"harvard_scw-2164",
"cho_edm_type.en_ssim":["Image"],
"cho_has_type.en_ssim":["Paintings", "Drawings"],
"cho_type_facet.en_ssim":["Image", "Image:Paintings"]
}
{
"id":"harvard_scw-8498",
"cho_edm_type.en_ssim":["Image", "Text"],
"cho_has_type.en_ssim":["Drawings", "Manuscripts"],
"cho_type_facet.en_ssim":["Image", "Image:Drawings"]
}

Analysis

The Item type summary displayed on the statistics page https://dev.dlmenetwork.org/library/statistics is generated via pivot facet using the values of cho_edm_type and cho_has_type. For reasons that are not clear to me, the links from the item type values displayed on the statistics page use the cho_type_facet field for the click to search queries for each listed type. Because of the data issue summarized above the counts on the statistics page and the results page are different.

To resolve this problem a few things should happen:

Queries that demonstrate the issue

Unless authorized and tunneled to the DLME Solr Dev instance these will not work

jacobthill commented 2 years ago

@corylown I figured out a safe fix for this. All of the offending records were in the Harvard SCW collection. They have been fixed and the data is being refreshed in dev. I think we still want to fix the inconsistent methods for building the counts on the stats page and the urls that those counts link to. We will add additional tests to ensure this error doesn't happen again. What do you propose we should do with the urls and counts? Should both counting methods be the same or should we simply update the urls from the stats page counts? I don't personally have a strong opinion but I don't want users to click on a count url that says '1457' and ever get to a different number of records.

corylown commented 2 years ago

@jacobthill I verified that the data fix looks good on dev. I would recommend that we change the stats page to use the cho_type_facet field to build the table and counts. It would then be consistent with the both the type facet on the search page and the existing links on the browse page. With this change the counts should always match (unless we find there are caching issues to resolve) because they will use the same field. If this sounds good let me know and I will put in the pull request with the changes.

jacobthill commented 2 years ago

@corylown I just confirmed with Wayne and Marcin that this makes sense to us.