sul-dlss / searchworks_traject_indexer

indexing MARC, MODS, and more for SearchWorks
Other
6 stars 1 forks source link

Bad `pub_date_sort` for some really old objects from SDR #1319

Open cbeer opened 7 months ago

cbeer commented 7 months ago

E.g. https://searchworks.stanford.edu/view/qb122dq4313.json has a pub_date_sort of 6000 and ends up sorting wrong.

jcoyne commented 7 months ago

@cbeer you added a comment here indicating that pub_date_sort is deprecated. Do we still need it? https://github.com/sul-dlss/searchworks_traject_indexer/blob/cdad30e01b4e44880af5c68923464adad8a74e2e/lib/traject/config/sdr_config.rb#L170-L172

It looks like pub_year_isi is currently only populated for SDR items: http://sul-solr-prod-b.stanford.edu/solr/#/searchworks-prod/query?q=pub_year_isi:*&q.op=OR&indent=true&fl=id. Compare that with the number that have pub_date_sort http://sul-solr-prod-b.stanford.edu/solr/#/searchworks-prod/query?q=pub_date_sort:*&q.op=OR&indent=true&fl=id

Can we copy this code https://github.com/sul-dlss/searchworks_traject_indexer/blob/cdad30e01b4e44880af5c68923464adad8a74e2e/lib/traject/config/folio_config.rb#L831C11-L831C31 to a pub_year_isi and wait for that to index and then use pub_year_isi in Searchworks?

cbeer commented 7 months ago

Well, we're very much still using it: https://github.com/sul-dlss/SearchWorks/blob/main/app/controllers/catalog_controller.rb#L460-L461

My vague memory is pub_year_isi is all well and good for MODS records, but ambiguity in the MARC makes it harder to do there.

dnoneill commented 7 months ago

This is happening because what is being returned from https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L47-L49 the stanford mods method is a Stanford::Mods::Imprint::DateRange. And apparently sorting that class returns a 6000. Super odd but okay. The quickest fix would be to update the pub_year_sort_str https://github.com/sul-dlss/searchworks_traject_indexer/blob/2720faaefe9d5c5d021e4cf405734233317c7474/lib/traject/config/sdr_config.rb#L172 to pub_year_int not sure if there are implications of this but it looks like it is the exact same method except it makes sure there is actually a date and if there is a daterange it will pull the data out of it. https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L19-L38.

cbeer commented 7 months ago

I think the problem with pub_year_int is sorting things with approximate years to the "right" place, e.g. I think we want "19--" to sort after anything with a known year 1000 - 1999.

dnoneill commented 7 months ago

@cbeer are you saying we want pub_date_sort to return a string and pub_year_int is always going to return an int and it won't allow for something like 195x? I am not sure I understand what specifically what in the code for pub_year_int is going to cause a problem. From what I can tell it will it checks to see if it is a daterange. If it is it will get the start or stop date. From there it will it will get the date value of the date from earliest_preferred_date (which is what pub_year_sort_str) is getting. From there is checks that the date is a value, then it will check if the date is an interval if it is it will get the from.year value, otherwise it will get the year value. So I guess I am wondering what part of that will cause problems? Because at the very least we need to integrate some of that into the pub_year_sort_str function.

cbeer commented 7 months ago

Check out the pub_date_sort we use for MARC records, and in particular the way we handle some of the non-numeric data:

https://github.com/sul-dlss/searchworks_traject_indexer/blob/2720faaefe9d5c5d021e4cf405734233317c7474/lib/traject/config/folio_config.rb#L726-L735

I'm not sure how we handle that kind of requirement with a pub_year_int?

dnoneill commented 7 months ago

from what I can tell we aren't doing that at all with pub_year_sort_str it just looks at a bunch for mods fields (dateIssued, dateCreated, dateCaptured) and then returns the smallest date, but it is making an assumption that all those dates are a DateValue field.

https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L47-L49

cbeer commented 7 months ago

Sure, but the MARC and MODS data are both indexing into the same field so we can give a consistent sort across all records. I'm not sure we can sort by pub_year_int for the MODS stuff and pub_date_sort for the MARC stuff and have things interfile and also continue to sort MARC records according to those rules for hyphens and colons.

dnoneill commented 7 months ago

Do you have time for a huddle on this latter? I get we can't use pub_year_int for this but I would like some context around which part we can't use so I an update the pub_date_sort method so it works correctly.

dnoneill commented 7 months ago

Okay I thought about it. The only other way I could think of was to allow ints but then fix what we are getting from the MARC records so that they are numbers. i.e. if we got 19xx we would convert to 1900 but marc is so variable that I think that would be way more messy and not provide us with a better result. I have updated the BCE conversion to -10000 and I change *-1 to abs just because I think its neater.