uclibs / ucrate

Scholar@UC: University of Cincinnati's self-submission institutional repository
https://scholar.uc.edu
Other
5 stars 3 forks source link

Trailing whitespace causes multiple facet entries #341

Open hortongn opened 6 years ago

hortongn commented 6 years ago

Descriptive summary

When multiple works have the same content in a faceted field, but one of those works has a space character at the end of the content, the application treats it as a different facet.

Example:

When viewing Subject facets on the catalog index page, "Biology" is listed twice. We instead want the whitespace to be stripped off so that "Biology" is listed just once with a count of 2.

It's best to strip off the whitespace when the facet is displayed instead of trying to strip off the whitespace when the work is saved.

index catalog scholar uc

crowesn commented 6 years ago

https://lucene.apache.org/solr/guide/6_6/update-request-processors.html https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html

jamesvanmil commented 6 years ago

It's best to strip off the whitespace when the facet is displayed

Based on what I know about Solr, I don't think this is possible.

I think the best case is to prevent bad data from being persisted into Fedora, but we should also be able to re-open to_solr to prevent bad data from going into the index.

crowesn commented 6 years ago

I've been fussing with solr config and I've found that the str field types are not filterable -- meaning I think that we can't get at this via Solr, we'd need to maybe reopen to_solr in Hyrax, which would be complex. Marking as blocked until we can get together and discuss.