Open peetucket opened 8 years ago
The Solr field is: pub_date (per https://github.com/sul-dlss/sul_exhibits_template/blob/master/app/controllers/catalog_controller.rb#L85)
The pub_date field is populated from stanford-mods method pub_date_facet (per https://github.com/sul-dlss/gdor-indexer/blob/master/lib/gdor/indexer/mods_fields.rb#L51)
Stanford mods defines pub_date_facet here: https://github.com/sul-dlss/stanford-mods/blob/master/lib/stanford-mods/searchworks.rb#L473-#L491
Which refers to pub_date in stanford-mods, which refers to pub_year in stanford mods, which ... etc.
I think the algorithm goes like this:
`
note that these two elements are allowed to have four attributes, thus:
<dateCreated encoding="w3cdtf" keyDate="yes" point="start" qualifier="approximate">1950</dateCreated>
<dateCreated encoding="w3cdtf" point="end" qualifier="approximate">2007</dateCreated>
No attention is paid to qualifier or keyDate or point attributes.
For entire universe of profiled mods objects (a small, bizarre subset of our DOR objects) has these attributes for date fields (from https://sul-solr.stanford.edu/solr/mods_profiler/select?rows=0), the attributes on the date fields of interest:
mods/originInfo/dateCreated:
<lst name="originInfo_dateCreated_encoding_sim">
<int name="w3cdtf">29761</int>
<int>49605</int> # attribute not present
</lst>
<lst name="originInfo_dateCreated_point_sim">
<int name="end">2088</int>
<int name="start">2088</int>
<int>77268</int>
</lst>
<lst name="originInfo_dateCreated_keyDate_sim">
<int name="yes">30247</int>
<int name="no">82</int>
<int>49075</int>
</lst>
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">2034</int>
<int name="inferred">274</int>
<int name="questionable">185</int>
<int>76873</int>
</lst>
mods/originInfo/dateIssued:
<lst name="originInfo_dateIssued_encoding_sim">
<int name="marc">24796</int>
<int name="w3cdtf">53</int>
<int>54517</int>
</lst>
<lst name="originInfo_dateIssued_point_sim">
<int name="start">10179</int>
<int name="end">10167</int>
<int>69187</int>
</lst>
<lst name="originInfo_dateIssued_keyDate_sim">
<int name="yes">2843</int>
<int>76523</int>
</lst>
<lst name="originInfo_dateIssued_qualifier_sim">
<int name="questionable">6026</int>
<int>73340</int>
</lst>
The collections this includes (some of which were profile a long time ago): https://sul-solr.stanford.edu/solr/mods_profiler/select?rows=0&facet.field=collection&facet.limit=-1&facet.sort=index
Here is the info on just the Feigenbaum collection dates:
no dateIssued elements
<lst name="originInfo_dateCreated_encoding_sim">
<int name="w3cdtf">16874</int>
<int>0</int>
</lst>
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">1680</int>
<int>15194</int>
</lst>
<lst name="originInfo_dateCreated_point_sim">
<int name="end">1673</int>
<int name="start">1673</int>
<int>15201</int>
</lst>
<lst name="originInfo_dateCreated_keyDate_sim">
<int name="yes">16874</int>
<int>0</int>
</lst>
I have now profiled all the DOR mods records for:
And the resulting data:
mods/originInfo/dateCreated:
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">2733</int>
<int name="inferred">462</int>
<int name="questionable">156</int>
<int>80794</int>
</lst>
<lst name="originInfo_dateCreated_encoding_sim">
<int name="w3cdtf">33112</int>
<int name="edtf">173</int>
<int>50860</int>
</lst>
<lst name="originInfo_dateCreated_point_sim">
<int name="end">3031</int>
<int name="start">3010</int>
<int>81114</int>
</lst>
<lst name="originInfo_dateCreated_keyDate_sim">
<int name="yes">33668</int>
<int name="no">403</int>
<int>50447</int>
</lst>
mods/originInfo/dateIssued:
<lst name="originInfo_dateIssued_qualifier_sim">
<int name="questionable">6035</int>
<int name="approximate">41</int>
<int>78069</int>
</lst>
<lst name="originInfo_dateIssued_encoding_sim">
<int name="marc">25749</int>
<int name="w3cdtf">196</int>
<int>58200</int>
</lst>
<lst name="originInfo_dateIssued_point_sim">
<int name="start">10256</int>
<int name="end">10244</int>
<int>73889</int>
</lst>
<lst name="originInfo_dateIssued_keyDate_sim">
<int name="yes">2989</int>
<int>81156</int>
</lst>
Which (profiled) collections have the qualifier attribute on these fields?
<lst name="collection">
<int name="bnf_images">6023</int> # not in prod anywhere
<int name="labor">41</int> # SW prod
<int name="batchelor">5</int> # maps of africa and SW prod
<int name="gould">3</int>
<int name="mclaughlin">3</int>
<int name="norwich">1</int>
</lst>
<lst name="collection">
<int name="feigenbaum">1680</int>
<int name="mclaughlin">440</int> # SW prod - CA as island
<int name="matter">273</int> # SW prod
<int name="menuez">231</int> # not in prod yet
<int name="norwich">207</int> # maps of africa
<int name="mss">175</int> # spotlight and SW prod
<int name="walters">117</int> # SW prod
<int name="shpc">83</int>
<int name="papyri">44</int>
<int name="gse-oa">21</int>
<int name="harrison">20</int>
<int name="fitch">15</int>
<int name="research-data">11</int>
<int name="mclaughlin_malta">7</int>
<int name="GSE_undergrad_honors_theses">3</int>
<int name="me310">3</int>
<int name="anthro">2</int>
<int name="ccrma">2</int>
<int name="film_arts">2</int>
<int name="folding">2</int>
<int name="mpeg">2</int>
<int name="multimedia">2</int>
<int name="vista">2</int>
<int name="baker">1</int>
<int name="dig-hum">1</int>
<int name="geospatial">1</int>
<int name="mccarthy">1</int>
<int name="pleist">1</int>
<int name="scrf">1</int>
<int name="spoke">1</int>
<lst>
Which collections might be affected:
BNF Images: 1/4 of dateIssued have 'questionable' qualifier attrib; no dateCreated fields exist (coll size: 24,666)
<lst name="originInfo_dateIssued_qualifier_sim">
<int name="questionable">6023</int>
<int>18643</int>
</lst>
Feigenbaum: 10% of dateCreated have qualifier attrib; no dateIssued fields exist (coll size: 16874)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">1680</int>
<int>15194</int>
</lst>
McLaughlin: >1/3 of dateCreated have qualifer attrib; dateIssued without (coll size: 740)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="inferred">289</int>
<int name="questionable">92</int>
<int name="approximate">59</int>
<int>300</int>
</lst>
<lst name="originInfo_dateIssued_qualifier_sim">
<int name="questionable">3</int>
<int>737</int>
</lst>
Matter: most have dateCreated qualifier attrib; no dateIssued fields exist (coll size: 296)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">268</int>
<int name="questionable">5</int>
<int>23</int>
</lst>
Menuez: maybe 2% have dateCreated qualifier; no dateIssued fields exist (coll size: 8343)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">194</int>
<int name="questionable">37</int>
<int>8112</int>
</lst>
<lst name="originInfo_dateIssued_qualifier_sim">
<int>8343</int>
</lst>
Norwich: 2/3 dateCreated have qual attrib; only 1 dateIssued field exists (coll size: 312)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="inferred">155</int>
<int name="approximate">45</int>
<int name="questionable">7</int>
<int>105</int>
</lst>
<lst name="originInfo_dateIssued_qualifier_sim">
<int name="questionable">1</int>
<int>311</int>
</lst>
MSS: 3/4 dateCreated have qual attrib; no dateIssued fields exist (coll size: 228)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">175</int>
<int>53</int>
</lst>
Walters: >1/3 dateCreated have qual attrib; 3 dateIssued fields exist (coll size: 299)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">117</int>
<int>182</int>
</lst>
<lst name="originInfo_dateIssued_qualifier_sim">
<int>299</int>
</lst>
Papyri: all dateCreated have qual attrib; no dateIssued fields exist (coll size: 44)
<lst name="originInfo_dateCreated_qualifier_sim">
<int name="approximate">44</int>
<int>0</int>
</lst>
The fixes are deployed; the collections are indexed. All that remains is the cleanup: removal of pub_date field from solr and from UI code
:+1:
Some records have ambiguous dates, leading to a date that doesn't make sense being attached to a record. For example:
https://purl.stanford.edu/hk334rq4790.mods
has
This gets picked up by the indexer as "1950", which is not correct.
Possible approach: ignore dates when there is a qualifier tag of "approximate" or "questionable". In other words, when this tag is found, just don't index the date at all.
TODO:
Once approach is known:
make sure to coordinate with SW indexing and REVS indexing and argo and ??? for changes to stanford-mods. https://github.com/search?utf8=%E2%9C%93&q=gem+stanford-mods&type=Code&ref=searchresults