Closed mcritchlow closed 7 years ago
@mcritchlow I've updated solr schema: https://github.com/ucsdlib/dams5-cc-pilot/blob/master/solr/config/schema.xml But after restart solr, I checked solr schema in development it is still the old file: http://localhost:8983/solr/#/hydra-development/files?file=schema.xml
Do you know if the above is the correct file path to change the schema.xml?
@mcritchlow Never mind. I figured out.
@mcritchlow I’ve tried adding the following filter to the analyzer in solr schema.xml :
For lexical sorting,
<filter class="solr.PatternReplaceFilterFactory" pattern="(\d+)" replacement="00000$1" replace="all"/>
For ignore case, <filter class="solr.LowerCaseFilterFactory"/>
For ignore punctuation, <filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
or
<filter class="solr.WordDelimiterFilter" generateWordParts="1" catenateWords="1" splitOnCaseChange="0" generateNumberParts="0" splitOnCaseChange="0"/>
And each time i restarted solr and checked the solr admin to make sure the schema.xml got updated, then reindexed the data. But it doesn’t seem working that the Blacklight facet sort result still display default lexical sorting, case insensitive and with punctuation.
Any suggestion?
Have you been able to test directly in the Solr instance to confirm that it is Blacklight that is somehow overriding the configuration changes you've made to schema.xml
?
If we know for sure that it's blacklight, maybe a question in the hydra dev slack channel might get a quick answer? I'm not sure how active the blacklist irc channel is at this point. @lsitu or @VivianChu, any ideas?
@mcritchlow @hweng I think it's a good idea to test it directly in Solr to see whether it works first.
@mcritchlow It doesn't seem that blacklight override the facet sorting, but I will check again.
@mcritchlow I've double checked blacklight facets module that it doesn't override facet sorting, which it use default facet index order as I compared it to the result of query executed from solr admin. But I have some ideas about it, and am trying the new approach. Thanks!
@mcritchlow I got the updates to solr schema working for case insensitive, removing punctuation and forcing numbers to sort numerically. Here is the result:
A question for overriding solr default lexically sorting, how may zeros do we want to left-pad a number?
@arwenhutt Any suggestion for the above question?
@hweng a quick question on the case sensitivity change, and potentially the numerical padding/sorting. are the displayed values to the end user changing, or just the schema/sorting configuration?
For example, if an original facet value was "OCEAN" is it now going to be displayed to an end user as "ocean"? Or will it still be shown as "OCEAN" but sorted as it if were "ocean"? I believe the latter is what is going to be desirable, as I think I recall @arwenhutt noting that the capitalized values are that way for a reason.
@mcritchlow Yes, the facet value that solr is sorting on would display "ocean". Or will it still be shown as "OCEAN" but sorted as it if were "ocean"? No, Solr just won't do that.
But since I've applied the filters only to facet values not records, so if the user click the facet link to the record it still preserve the original case in those fields.
@hweng Thanks for clarifying 👍 I think that distinction is very important information for everyone to know when considering whether this solution will work for us.
@arwenhutt @gamontoya - will that be acceptable?
@mcritchlow @arwenhutt @gamontoya From my research for the solr sorting options, the workaround solution I applied is to keep the records original fields and only apply filters to the facets that solr is sorting on. Here you could see that the records still preserve the original fields of "2", "10", "Mom's", while the facets got removed punctuation and added padding for sorting purpose:
@mcritchlow I'm not sure I captured this correctly. Are you saying that a topic in all caps, like OCEAN would sort at "ocean" and would also display as ocean and not OCEAN?
@hweng On your topic sort example above, are you purposely asking numerical values to appear first?
@gamontoya - I think @hweng's example above does a good job illustrating the solution that she has come up with. Basically, the facet sort isn't ignoring case. It's create lower cased facet values and leaving the show values the same. moms
vs Mom's
I wanted to clarify this for everyone, since I'm not sure this is a desirable outcome.
@gamontoya Yes, in solr sorting the numerical values appears before any alphabet letters a - z.
@hweng Now that part, I'm not sure I like. I prefer the numerical values after A-Z. @mcritchlow @arwenhutt Thoughts?
@arwenhutt @gamontoya From DAMS4 data, it is mostly years. Please see the following screening shot. You may not see it from browsing page, but you can view it by direct type in the url: http://library.ucsd.edu/dc/search/facet/subject_topic_sim?facet.sort=index
@arwenhutt @gamontoya If the facet sorting updates looks good to you, I will create the pull request for it. If you have any other thought, would you please comment it here? Thank you!
@hweng Can you sort alphabetic followed by numeric?
@gamontoya I thought Matt had already explained to you that the solr sorting don't have options for that and you cannot sort alphabetic followed by numeric. It have two options sort by index and sort by count. The sort by index is to sort by alphabetic that starts with numeric. If you want to do very customized facet sorting which do not use solr sorting and blacklight modules, then it would be another project.
@hweng No new project here. Go ahead and make the pull request and we'll see how things look/behave.
@gamontoya Thanks! A pull request has been submitted to https://github.com/ucsdlib/horton/pull/6
The group decided not to implement the above solr filters to facet ordering now. Will revisit in the future.
Discussing this with Schol Comm today: is it possible to make the filter case-insensitive, and to apply case folding to the facets at the time the keyword list is generated? I'm not sure if this is what you referring to above as rules for automatically merging
keywords with different cases. It seems like the cases in which this would produce undesirable results (e.g., the acronymOCEAN
gets collapsed with the keyword ocean
) would occur less frequently than the cases where the records are not being correctly aggregated due to unintentional differences in capitalization.
This issue was raised in the Review meeting from Sprint 21. Questions: