Closed jacobthill closed 2 years ago
@jacobthill I know the desire to sort by date came up in the context of the interviewee looking at a browse category, but I'd suggest the application update we might want here is to add Date as a search results sort field. I believe we use the same sort field configuration for all search results, whether displayed on a browse category results page or a normal search results page.
In other words, I think a user who wants to sort items by date on a browse category results page would be just as likely to want to sort items by date when doing a normal search.
Thanks @ggeisler, that makes sense to me. I'll update this ticket.
Some potential challenges with implementing this feature are the limitations of the metadata. Some records will not have normalized dates, though the vast majority do. Others will have wide ranges (e.g. Cambridge records are 6th to 19th century). We will need to choose either the earliest or latest year to sort on. Probably makes sense to choose the earliest year. We will also need to provide some context to users about some of the features and the data quality. We should probably have an about page that gives enough context about how the metadata gets into DLME, some limitations, potentially confusing results from some features, and what to do when encountering errors. Overall, it seems like it will still be useful and could lead to better metadata in the long run. We will need to figure out how to manage user expectations given the poor date data.
Date information in DLME in Solr is stored in two different fields:
cho_date_range_norm_isim
cho_date_range_hijri_isim
These are both solr.TrieIntField
type multi-valued fields. I think for the purposes of sorting we can pick one since I would expect the sort order to be the same with either field. (If I am wrong then we may need to consider/configure both fields.)
Solr does provide the ability to supply a multi-valued field to the sort parameter, but the behavior is a bit quirky and there's an unfortunate bug. The DLME Solr schema does not have the sortMissingLast=true
parameter attribute applied to the int
type field so I will describe the behavior for the current configuration and then the behavior if we were to supply this attribute on the int
type definition in the schema.
The default sort behavior on mulit-valued int fields is equivalent to explicitly sorting using the 2 argument field() function: sort=field(name,min) asc
and sort=field(name,max) desc
. So, when sorting in ascending order the minimum value is used and the when sorting in descending order the maximum value is used. This is somewhat confusing behavior.
With the TrieIntField
(when sortMissingLast
is not set) when relying on the default behavior the records with missing values sort FIRST when ASC and sort LAST when DESC. However, if the sortMissingLast
param attribute is set to true
then the records with missing values always sort last.
You are supposed to be able to explicitly pass the field function to the sort parameter to control whether the min or max value of the field is used for sorting but there is a bug in Solr where supplying the field()
function to sort
with the TrieIntField
causes an error -- as described here: https://issues.apache.org/jira/browse/SOLR-12457.
All that said, it may end up making more sense to create a single valued field specifically for sorting by date that contains either the min or max value from one of the date range fields.
cho_date_range_norm_isim
field (or a date derived from it) for sorting or would the sort order be different with the cho_date_range_hijri_isim
field. (Expect these are equivalent for sorting, but need to confirm.)The following concerns (from above) should probably be handled as separate issues once the date sort feature is implemented:
We will also need to provide some context to users about some of the features and the data quality. We should probably have an about page that gives enough context about how the metadata gets into DLME, some limitations, potentially confusing results from some features, and what to do when encountering errors.
@corylown thank you for this very detailed analysis. Its really helpful.
cho_date_range_norm_isim
but if that will require any level of complexity we can do it during transform.cho_date_range_norm_isim
or the lowest value (should also be the first value) from it. Definitely agree that we need to think about the contextual information we provide to users on the about page. I also wonder if there are things that we can do in the UI on this page to indicate that dates are sorted based on the earliest year and that records missing years will be at the end. @ggeisler I opened a new ticket and added the design_needed
tag https://github.com/sul-dlss/dlme/issues/1470. In addition to that I will include this in our search tips page revisions.
@jacobthill I understand the ideas of displaying missing values at the end and using the min date value, but am unclear what the second part of this means:
I think we do want to use the min function for sorting; ascending order definitely makes more sense than descending order.
Are we only providing the user with a single sort option (Sort by Date (old to new)
) or are we offering to sort dates in either direction (also providing Sort by Date (new to old)
)?
@ggeisler I assumed we were choosing one sort direction: Sort by Date (old to new)
. I am open to having both depending on the technical complexity and design considerations. @corylown please clarify if it will be easy to do both. I think we would have to pull two separate fields values, the earliest date for Sort by Date (old to new)
and the latest date for Sort by Date (new to old)
. I also can't think of a use at the moment for the latter, though maybe people would want to sort some categories by more recent content.
@jacobthill I think it would be best to start from how we'd prefer date sort to work and the design we'd like and work back to technical implementation concerns and complexity. My assessment at the moment is that any solution is likely to involve some changes to the Solr schema, possibly an additional date field for sort (either defined in the transform step, or possibly, as a copy field directive in the Solr schema), and some configuration/design on the front end. I'd be happy to have a brief meeting about this to talk it through. The options available and trade-offs in complexity are enough to warrant some discussion.
To be clear, the design in #1470 will work either way (only sort by old to new, or both options). If we only want to offer old to new, that's just the only date sort option we offer to the user.
I don't think we can predict that no user will have a use case for wanting to see results sorted by new to old, even it is the less common case, so that's why I asked the question. It's a little odd to only offer sorting a field in one direction. But if there are reasons not to offer both, that's fine with me since we'd at least be covering the primary use case.
Decision from planning: offer both Sort by Date (old to new)
and Sort by Date (new to old)
, unless it becomes hairy to implement.
@jacobthill here's a summary of my current understanding of how we'd like date sort to behave. I'd like confirmation from you that this matches your expectations before proceeding with implementation. If this all seems right, I'll spin out some additional tickets because there are multiple moving parts and the order they are completed is important.
Sort by Date (old to new)
. Ascending order sort uses the minimum date value derived from cho_date_range_norm_isim
for sorting.Sort by Date (new to old)
. Descending order uses the maximum date value derived from cho_date_range_norm_isim
for sorting.sortMissingLast=true
attribute on the field types used to store dates for sorting (fields specified in the next tasks). Apply this schema change to dev Solr instance. (Need to determine how to do this.)TrieIntField
, with dynamic field suffix of _isi
and contains the minimum value from cho_date_range_norm_isim
. Suggested field name: cho_date_norm_min_isi
TrieIntField
, with dynamic field suffix of _isi
and contains the maximum value from cho_date_range_norm_isim
. Suggested field name: cho_date_norm_max_isi
sortMissingLast=true
attribute) it is best practice to reindex all content in Solr.@corylown yes that sounds right to me.
Add date as a sort option for search results and the browse category page.