Closed jacobthill closed 2 years ago
@jacobthill the statistics page is cached so there is bound to be some drift between the counts when the page was cached and the current state of the index. I don't think this is fixable unless we stop caching the page, and I would expect it was cached to speed up long page load times. To reduce the drift we could expire the cache at some reasonable interval.
@cbeer pointed out that he cleared the cache and was still seeing some difference in counts so I'll investigate further. Caching will inevitably cause differences in counts though.
@corylown I caching might be the problem. If your investigation indicates that it is, I think keeping it in place and implementing a timed recount could be one option for solving this problem. Another option (not sure if this will work technically) is to refresh the counts every time records are added. This is the only time counts should change. I believe the facet counts are also cached. Are these cached counts accessible via some global variable that can be the same in both places?
@jacobthill -- Chris and I chatted some about this and he thinks the cache is supposed be cleared when new data is indexed. Maybe that's not working as expected. Caching and cache expiration are notoriously difficult to get right. I suspect I am going to find something else is not quite right, though. I'll keep at it. I don't think there's a reasonable way to synchronize the caching happening on the results page and the stats page.
This appears to be a data issue. In cases where cho_edm_type
and/or cho_has_type
contain multiple values only the first cho_edm_type
and first cho_edm_type
:cho_has_type
pair are used to generate the values of the cho_type_facet
field.
Real examples:
{
"id":"harvard_scw-2164",
"cho_edm_type.en_ssim":["Image"],
"cho_has_type.en_ssim":["Paintings", "Drawings"],
"cho_type_facet.en_ssim":["Image", "Image:Paintings"]
}
{
"id":"harvard_scw-8498",
"cho_edm_type.en_ssim":["Image", "Text"],
"cho_has_type.en_ssim":["Drawings", "Manuscripts"],
"cho_type_facet.en_ssim":["Image", "Image:Drawings"]
}
The Item type summary displayed on the statistics page https://dev.dlmenetwork.org/library/statistics is generated via pivot facet using the values of cho_edm_type
and cho_has_type
. For reasons that are not clear to me, the links from the item type values displayed on the statistics page use the cho_type_facet
field for the click to search queries for each listed type. Because of the data issue summarized above the counts on the statistics page and the results page are different.
To resolve this problem a few things should happen:
cho_edm_type
and cho_has_type
agree with the values assembled to form cho_type_facet
cho_type_facet
to assemble the browse by item type view and the generated query links. This change would mask, but not fix the underlying data issue.Unless authorized and tunneled to the DLME Solr Dev instance these will not work
@corylown I figured out a safe fix for this. All of the offending records were in the Harvard SCW collection. They have been fixed and the data is being refreshed in dev. I think we still want to fix the inconsistent methods for building the counts on the stats page and the urls that those counts link to. We will add additional tests to ensure this error doesn't happen again. What do you propose we should do with the urls and counts? Should both counting methods be the same or should we simply update the urls from the stats page counts? I don't personally have a strong opinion but I don't want users to click on a count url that says '1457' and ever get to a different number of records.
@jacobthill I verified that the data fix looks good on dev. I would recommend that we change the stats page to use the cho_type_facet
field to build the table and counts. It would then be consistent with the both the type facet on the search page and the existing links on the browse page. With this change the counts should always match (unless we find there are caching issues to resolve) because they will use the same field. If this sounds good let me know and I will put in the pull request with the changes.
@corylown I just confirmed with Wayne and Marcin that this makes sense to us.
Some of the item counts under type on the contributor page differ from those on the home page. e.g. Drawings has 1574 on the contributor page but following the link yields only 1096