sul-dlss / searchworks_traject_indexer

indexing MARC, MODS, and more for SearchWorks
Other
6 stars 1 forks source link

Do something with non-printing control characters in (usually) eloader records #126

Open cbeer opened 6 years ago

cbeer commented 6 years ago

Records:

Or in 26066:

        -"callnum_facet_hsim" => ["Dewey Classification|300s - Social Sciences|300s - Social Sciences", "LC Classification|T - Technology|TX - Home Economics"],
        +"callnum_facet_hsim" => ["Dewey Classification|300s - Social Sciences|300s - Social Sciences", "LC Classification|\u0001|"],
cbeer commented 6 years ago

This seems to affect a small number of eloader records; this may be a good candidate for sending to data control or someone else to clean up on the symphony side. Failing that, maybe we can just be happy with the non-escaped control characters?

shelleydoljack commented 6 years ago

I pulled record 26066 out of the full dump file and looked at the hex values for the 999 field with the reported control character:

0440: 4E 4F 1F 75 39 2F 32 37 2F 31 39 37 36 1E 20 20   NO.u9/27/1976.  
0450: 1F 61 54 58 20 33 30 31 2E 31 20 2E 4D 33 36 38   .aTX 301.1 .M368
0460: 52 1F 77 4C 43 1F 63 31 1F 69 33 36 31 30 35 30   R.wLC.c1.i361050
0470: 30 33 36 35 34 36 37 35 1F 6C 53 54 41 43 4B 53   03654675.lSTACKS
0480: 1F 6D 45 44 55 43 41 54 49 4F 4E 1F 72 59 1F 73   .mEDUCATION.rY.s
0490: 59 1F 74 53 54 4B 53 2D 4D 4F 4E 4F 1F 75 39 2F   Y.tSTKS-MONO.u9/
04A0: 31 33 2F 32 30 31 31 1E 1D                        13/2011..

I also ran this MARC file in debug-mode with our traject config and the values for callnum_facet_hsim are:

callnum_facet_hsim        LC Classification|T - Technology|TX - Home Economics | Dewey Classification|300s - Social Sciences|300s - Social Sciences

It seems maybe something else is going on when written to solr?