sul-dlss-deprecated / spotlight-dor-resources

[DEPRECATED] Harvest Stanford DOR resources into a Spotlight exhibit
Other
1 stars 0 forks source link

When dates have a qualifier attribute, ignore the date field when indexing #53

Open peetucket opened 8 years ago

peetucket commented 8 years ago

Some records have ambiguous dates, leading to a date that doesn't make sense being attached to a record. For example:

https://purl.stanford.edu/hk334rq4790.mods

has

<dateCreated encoding="w3cdtf" keyDate="yes" point="start" qualifier="approximate">1950</dateCreated>
<dateCreated encoding="w3cdtf" point="end" qualifier="approximate">2007</dateCreated>

This gets picked up by the indexer as "1950", which is not correct.

Possible approach: ignore dates when there is a qualifier tag of "approximate" or "questionable". In other words, when this tag is found, just don't index the date at all.

TODO:

Once approach is known:

make sure to coordinate with SW indexing and REVS indexing and argo and ??? for changes to stanford-mods. https://github.com/search?utf8=%E2%9C%93&q=gem+stanford-mods&type=Code&ref=searchresults

ndushay commented 8 years ago

The Solr field is: pub_date (per https://github.com/sul-dlss/sul_exhibits_template/blob/master/app/controllers/catalog_controller.rb#L85)

The pub_date field is populated from stanford-mods method pub_date_facet (per https://github.com/sul-dlss/gdor-indexer/blob/master/lib/gdor/indexer/mods_fields.rb#L51)

Stanford mods defines pub_date_facet here: https://github.com/sul-dlss/stanford-mods/blob/master/lib/stanford-mods/searchworks.rb#L473-#L491

Which refers to pub_date in stanford-mods, which refers to pub_year in stanford mods, which ... etc.

I think the algorithm goes like this:

  1. look for dates in these two fields: `xml pick me! pick me!

    `

note that these two elements are allowed to have four attributes, thus:

<dateCreated encoding="w3cdtf" keyDate="yes" point="start" qualifier="approximate">1950</dateCreated>
<dateCreated encoding="w3cdtf" point="end" qualifier="approximate">2007</dateCreated>
  1. selected only those dates that have encoding="marc". Only if there are no such dates do we consider other values in these 2 fields.
  2. for each value string we end up with: 3a. If 4 char year has ? chars, change them to 0. 3b. Get rid of surrounding [ ] if present. Get rid of trailing ? if present.
  3. Return the first (first in nokogiri parsing of elements) of:
  4. four digit year (including, e.g. 1865 if the string is of the pattern '1865-6 CE'
  5. year ending in single u (which is changed to 0), e.g. 195u becomes 1950
  6. year ending in 2 u e.g. 18uu becomes 18-- (yes, hyphens) and more: see https://github.com/sul-dlss/stanford-mods/blob/master/lib/stanford-mods/searchworks.rb#L435-#L446

No attention is paid to qualifier or keyDate or point attributes.

For entire universe of ​profiled​ mods objects (a small, bizarre subset of our DOR objects) has these attributes for date fields (from https://sul-solr.stanford.edu/solr/mods_profiler/select?rows=0), the attributes on the date fields of interest:

mods/originInfo/dateCreated:

<lst name="originInfo_dateCreated_encoding_sim">
  <int name="w3cdtf">29761</int>
  <int>49605</int>  # attribute not present
</lst>

<lst name="originInfo_dateCreated_point_sim">
  <int name="end">2088</int>
  <int name="start">2088</int>
  <int>77268</int>
</lst>

<lst name="originInfo_dateCreated_keyDate_sim">
  <int name="yes">30247</int>
  <int name="no">82</int>
  <int>49075</int>
</lst>

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">2034</int>
  <int name="inferred">274</int>
  <int name="questionable">185</int>
  <int>76873</int>
</lst>

mods/originInfo/dateIssued:

<lst name="originInfo_dateIssued_encoding_sim">
  <int name="marc">24796</int>
  <int name="w3cdtf">53</int>
  <int>54517</int>
</lst>

<lst name="originInfo_dateIssued_point_sim">
  <int name="start">10179</int>
  <int name="end">10167</int>
  <int>69187</int>
</lst>

<lst name="originInfo_dateIssued_keyDate_sim">
  <int name="yes">2843</int>
  <int>76523</int>
</lst>

<lst name="originInfo_dateIssued_qualifier_sim">
  <int name="questionable">6026</int>
  <int>73340</int>
</lst>

The collections this includes (some of which were profile a long time ago): https://sul-solr.stanford.edu/solr/mods_profiler/select?rows=0&facet.field=collection&facet.limit=-1&facet.sort=index

ndushay commented 8 years ago

Here is the info on just the Feigenbaum collection dates:

no dateIssued elements

<lst name="originInfo_dateCreated_encoding_sim">
  <int name="w3cdtf">16874</int>
  <int>0</int>
</lst>

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">1680</int>
  <int>15194</int>
</lst>

<lst name="originInfo_dateCreated_point_sim">
  <int name="end">1673</int>
  <int name="start">1673</int>
  <int>15201</int>
</lst>

<lst name="originInfo_dateCreated_keyDate_sim">
  <int name="yes">16874</int>
  <int>0</int>
</lst>
ndushay commented 8 years ago

I have now profiled all the DOR mods records for:

And the resulting data:

mods/originInfo/dateCreated:

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">2733</int>
  <int name="inferred">462</int>
  <int name="questionable">156</int>
  <int>80794</int>
</lst>

<lst name="originInfo_dateCreated_encoding_sim">
 <int name="w3cdtf">33112</int>
  <int name="edtf">173</int>
  <int>50860</int>
</lst>

<lst name="originInfo_dateCreated_point_sim">
  <int name="end">3031</int>
  <int name="start">3010</int>
  <int>81114</int>
</lst>

<lst name="originInfo_dateCreated_keyDate_sim">
 <int name="yes">33668</int>
  <int name="no">403</int>
  <int>50447</int>
</lst>

mods/originInfo/dateIssued:

<lst name="originInfo_dateIssued_qualifier_sim">
  <int name="questionable">6035</int>
  <int name="approximate">41</int>
  <int>78069</int>
</lst>

<lst name="originInfo_dateIssued_encoding_sim">
  <int name="marc">25749</int>
  <int name="w3cdtf">196</int>
  <int>58200</int>
</lst>

<lst name="originInfo_dateIssued_point_sim">
  <int name="start">10256</int>
  <int name="end">10244</int>
  <int>73889</int>
</lst>

<lst name="originInfo_dateIssued_keyDate_sim">
  <int name="yes">2989</int>
  <int>81156</int>
</lst>
ndushay commented 8 years ago

Which (profiled) collections have the qualifier attribute on these fields?

https://sul-solr.stanford.edu/solr/mods_profiler/select?rows=0&fq=originInfo_dateIssued_qualifier_sim:*&facet.field=collection

<lst name="collection">
  <int name="bnf_images">6023</int>  # not in prod anywhere
  <int name="labor">41</int>  # SW prod
  <int name="batchelor">5</int>  # maps of africa and SW prod
  <int name="gould">3</int>
  <int name="mclaughlin">3</int>
  <int name="norwich">1</int>
</lst>

https://sul-solr.stanford.edu/solr/mods_profiler/select?rows=0&fq=originInfo_dateCreated_qualifier_sim:*&facet.field=collection

<lst name="collection">
  <int name="feigenbaum">1680</int>
  <int name="mclaughlin">440</int>  # SW prod - CA as island
  <int name="matter">273</int>  # SW prod
  <int name="menuez">231</int> # not in prod yet
  <int name="norwich">207</int>  # maps of africa
  <int name="mss">175</int> # spotlight and SW prod
  <int name="walters">117</int> # SW prod
  <int name="shpc">83</int>
  <int name="papyri">44</int>
  <int name="gse-oa">21</int>
  <int name="harrison">20</int>
  <int name="fitch">15</int>
  <int name="research-data">11</int>
  <int name="mclaughlin_malta">7</int>
  <int name="GSE_undergrad_honors_theses">3</int>
  <int name="me310">3</int>
  <int name="anthro">2</int>
  <int name="ccrma">2</int>
  <int name="film_arts">2</int>
  <int name="folding">2</int>
  <int name="mpeg">2</int>
  <int name="multimedia">2</int>
  <int name="vista">2</int>
  <int name="baker">1</int>
  <int name="dig-hum">1</int>
  <int name="geospatial">1</int>
  <int name="mccarthy">1</int>
  <int name="pleist">1</int>
  <int name="scrf">1</int>
  <int name="spoke">1</int>
<lst>
ndushay commented 8 years ago

Which collections might be affected:

BNF Images: 1/4 of dateIssued have 'questionable' qualifier attrib; no dateCreated fields exist (coll size: 24,666)

<lst name="originInfo_dateIssued_qualifier_sim">
  <int name="questionable">6023</int>
  <int>18643</int>
</lst>

Feigenbaum: 10% of dateCreated have qualifier attrib; no dateIssued fields exist (coll size: 16874)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">1680</int>
  <int>15194</int>
</lst>

McLaughlin: >1/3 of dateCreated have qualifer attrib; dateIssued without (coll size: 740)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="inferred">289</int>
  <int name="questionable">92</int>
  <int name="approximate">59</int>
  <int>300</int>
</lst>

<lst name="originInfo_dateIssued_qualifier_sim">
  <int name="questionable">3</int>
  <int>737</int>
</lst>

Matter: most have dateCreated qualifier attrib; no dateIssued fields exist (coll size: 296)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">268</int>
  <int name="questionable">5</int>
  <int>23</int>
</lst>

Menuez: maybe 2% have dateCreated qualifier; no dateIssued fields exist (coll size: 8343)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">194</int>
  <int name="questionable">37</int>
  <int>8112</int>
</lst>

<lst name="originInfo_dateIssued_qualifier_sim">
  <int>8343</int>
</lst>

Norwich: 2/3 dateCreated have qual attrib; only 1 dateIssued field exists (coll size: 312)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="inferred">155</int>
 <int name="approximate">45</int>
  <int name="questionable">7</int>
  <int>105</int>
</lst>

<lst name="originInfo_dateIssued_qualifier_sim">
 <int name="questionable">1</int>
 <int>311</int>
</lst>

MSS: 3/4 dateCreated have qual attrib; no dateIssued fields exist (coll size: 228)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">175</int>
  <int>53</int>
</lst>

Walters: >1/3 dateCreated have qual attrib; 3 dateIssued fields exist (coll size: 299)

<lst name="originInfo_dateCreated_qualifier_sim">
 <int name="approximate">117</int>
 <int>182</int>
</lst>

<lst name="originInfo_dateIssued_qualifier_sim">
  <int>299</int>
</lst>

Papyri: all dateCreated have qual attrib; no dateIssued fields exist (coll size: 44)

<lst name="originInfo_dateCreated_qualifier_sim">
  <int name="approximate">44</int>
  <int>0</int>
</lst>
ndushay commented 8 years ago

The fixes are deployed; the collections are indexed. All that remains is the cleanup: removal of pub_date field from solr and from UI code

peetucket commented 8 years ago

:+1: