mlibrary / heliotrope

Codebase for Fulcrum, a Samvera-based digital publishing platform built by the University of Michigan Library
https://fulcrum.org
Apache License 2.0
45 stars 9 forks source link

Multi-valued fields that should be singular. Maybe. #685

Closed sethaj closed 7 years ago

sethaj commented 7 years ago

In solr I'm seeing creator_full_name_tesim (monograph or file_set) coming through as a multi-value field like here:

full-name-multi-in-solr

it looks like it's specifically not supposed to be multi.

https://github.com/mlibrary/heliotrope/blob/master/spec/models/stores_creator_name_separately_spec.rb#L24

How/where is this happening? Is it supposed to be multi or single?

sethaj commented 7 years ago

There seems to a difference between .to_solr and what's actually in the solr index? to_solr has the single value, but query produces a multi-value.

 pry(main)> FileSet.find('jq085m87j').to_solr['creator_full_name_tesim']
=> "Stanislavsky, Konstantin"
 pry(main)> ActiveFedora::SolrService.query("{!terms f=id}jq085m87j")[0]['creator_full_name_tesim']
=> ["Stanislavsky, Konstantin"]
conorom commented 7 years ago

Is this an input-output discrepancy? Everything that ends in 'm' is meant to be multi-valued. https://github.com/projecthydra/hydra-head/wiki/Solr-Schema

In general we (or CC?) use multi for all metadata. You can still push whatever you want onto the document though (most fields ending in 'm' show [] values in your average to_solr results, but not all), but maybe Solr packages the result according to that 'm' when spitting things out?

val99erie commented 7 years ago

The short answer:

The fields are stored in fedora as single-value, but you can index them in solr any way you like. In this case, we chose 'stored_searchable', which makes them multi-value (in solr).

The reason we chose 'stored_searchable' is because that's what I always choose for fields that should be searchable. I probably just didn't think about the fact that it would be more technically accurate to store the field as *_tesi instead of *_tesim, because most fields are multi-value.

You might consider it sloppy to index the field in solr as multi-value when we know that it's always single-value, but it should be harmless to index it that way. I don't think we need to change it unless it's causing a problem.

The long answer:

The properties for the first and last name are defined as multiple: false, so they'll be stored in fedora as single-value fields. Those fields are defined here:

https://github.com/mlibrary/heliotrope/blob/master/app/models/concerns/stores_creator_name_separately.rb#L7

And they are being set as single-value here: https://github.com/mlibrary/heliotrope/blob/master/app/models/concerns/stores_creator_name_separately.rb#L21

So that's why it's single-value when you call monograph.to_solr.

But when solr saves the value, it doesn't matter that you set it as a string instead of an array; Solr will respect the config of that dynamic field and store it as an array. 'stored_searchable' translates to *_tesim, which you can see in the rails console:

[19] pry(main)> Solrizer.solr_name('foo', :stored_searchable)
=> "foo_tesim"

And, in our solr config, we defined *_tesim: https://github.com/mlibrary/heliotrope/blob/master/solr/config/schema.xml#L222

where the meaning of *_tesim is this:

There's a little more info about the dynamic fields here: https://wiki.apache.org/solr/SchemaXml#Common_field_options

sethaj commented 7 years ago

Great, ok thanks that makes sense. The problem I was running into was how I was building say a presenter in a test and that it's fields sometimes didn't match what I saw in the app (singular vs. multi). So I think I just need to be more intentional/explicit in my tests when it comes to representing solr docs so that my tests match what's actually happening in the app.

val99erie commented 7 years ago

Another option would be to change the indexed fields from *_tesim to *_tesi, but then aside from changing the code, you'll have to re-index the data that already exists in your production app.