Open cbeer opened 7 months ago
@cbeer you added a comment here indicating that pub_date_sort
is deprecated. Do we still need it?
https://github.com/sul-dlss/searchworks_traject_indexer/blob/cdad30e01b4e44880af5c68923464adad8a74e2e/lib/traject/config/sdr_config.rb#L170-L172
It looks like pub_year_isi
is currently only populated for SDR items: http://sul-solr-prod-b.stanford.edu/solr/#/searchworks-prod/query?q=pub_year_isi:*&q.op=OR&indent=true&fl=id. Compare that with the number that have pub_date_sort
http://sul-solr-prod-b.stanford.edu/solr/#/searchworks-prod/query?q=pub_date_sort:*&q.op=OR&indent=true&fl=id
Can we copy this code https://github.com/sul-dlss/searchworks_traject_indexer/blob/cdad30e01b4e44880af5c68923464adad8a74e2e/lib/traject/config/folio_config.rb#L831C11-L831C31 to a pub_year_isi
and wait for that to index and then use pub_year_isi
in Searchworks?
Well, we're very much still using it: https://github.com/sul-dlss/SearchWorks/blob/main/app/controllers/catalog_controller.rb#L460-L461
My vague memory is pub_year_isi
is all well and good for MODS records, but ambiguity in the MARC makes it harder to do there.
This is happening because what is being returned from https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L47-L49 the stanford mods method is a Stanford::Mods::Imprint::DateRange. And apparently sorting that class returns a 6000. Super odd but okay. The quickest fix would be to update the pub_year_sort_str
https://github.com/sul-dlss/searchworks_traject_indexer/blob/2720faaefe9d5c5d021e4cf405734233317c7474/lib/traject/config/sdr_config.rb#L172 to pub_year_int
not sure if there are implications of this but it looks like it is the exact same method except it makes sure there is actually a date and if there is a daterange it will pull the data out of it. https://github.com/sul-dlss/stanford-mods/blob/19af5eb4e1be528f3ee71a18c19a3a99fd435d8c/lib/stanford-mods/concerns/origin_info.rb#L19-L38.
I think the problem with pub_year_int
is sorting things with approximate years to the "right" place, e.g. I think we want "19--" to sort after anything with a known year 1000 - 1999.
@cbeer are you saying we want pub_date_sort to return a string and pub_year_int is always going to return an int and it won't allow for something like 195x? I am not sure I understand what specifically what in the code for pub_year_int is going to cause a problem. From what I can tell it will it checks to see if it is a daterange. If it is it will get the start or stop date. From there it will it will get the date value of the date from earliest_preferred_date (which is what pub_year_sort_str) is getting. From there is checks that the date is a value, then it will check if the date is an interval if it is it will get the from.year value, otherwise it will get the year value. So I guess I am wondering what part of that will cause problems? Because at the very least we need to integrate some of that into the pub_year_sort_str function.
Check out the pub_date_sort
we use for MARC records, and in particular the way we handle some of the non-numeric data:
I'm not sure how we handle that kind of requirement with a pub_year_int
?
from what I can tell we aren't doing that at all with pub_year_sort_str
it just looks at a bunch for mods fields (dateIssued, dateCreated, dateCaptured) and then returns the smallest date, but it is making an assumption that all those dates are a DateValue field.
Sure, but the MARC and MODS data are both indexing into the same field so we can give a consistent sort across all records. I'm not sure we can sort by pub_year_int
for the MODS stuff and pub_date_sort
for the MARC stuff and have things interfile and also continue to sort MARC records according to those rules for hyphens and colons.
Do you have time for a huddle on this latter? I get we can't use pub_year_int for this but I would like some context around which part we can't use so I an update the pub_date_sort method so it works correctly.
Okay I thought about it. The only other way I could think of was to allow ints but then fix what we are getting from the MARC records so that they are numbers. i.e. if we got 19xx we would convert to 1900 but marc is so variable that I think that would be way more messy and not provide us with a better result. I have updated the BCE conversion to -10000 and I change *-1 to abs just because I think its neater.
E.g. https://searchworks.stanford.edu/view/qb122dq4313.json has a pub_date_sort of
6000
and ends up sorting wrong.