Normalize punctuation on input

rism-digital / muscat

🗂️ A Rails application for the inventory of handwritten and printed music scores

34 stars 16 forks source link

Some characters need to be normalized (smart quotes vs. apostrophes) but some need to be allowed (u vs ü). Currently, ’ and ' are read as different punctuation marks. This causes misalignment in city names in Institutions: La Seu d’Urgell https://rism.online/institutions/30079707 La Seu d'Urgell https://rism.online/institutions/30005481 and duplicates in Titles/Texts: https://muscat.rism.info/admin/standard_titles?utf8=%E2%9C%93&q%5Btitle_equals%5D=Au+sein+des+alarmes+l%E2%80%99amour+a+des+charmes&commit=Filter&order=id_desc Au sein des alarmes l’amour a des charmes Au sein des alarmes l'amour a des charmes

This arises especially when copying from websites or data imports. The problem has been solved with searching (see https://github.com/rism-digital/muscat/issues/622 ) but not on the input side.

I can think of the following:

’ and '
" " and “ ”
- – — (dash, n-dash, m-dash)

For the dashes, only one is needed (the dash I think?) in the standardized fields.

What about spaces? Sometimes that acts strangely (Excel doesn't always read the spaces as spaces) but I can't describe it further than that.

This is most important the fields that are linked to authority files, not everywhere (like in notes fields).

bad_chars = { '\t': u' ', '': u' ', u'': '', # Macintosh newline char? u' ': u' ', # Unicode 0xA0, NO-BREAK SPACE u' ': u' ', # Unicode 0x200E, LEFT-TO-RIGHT MARK u'‘': u"'", # Unicode 0xA0, LEFT SINGLE QUOTATION MARK u'’': u"'", # Unicode 0x2019, RIGHT SINGLE QUOTATION MARK u'´': u"'", # Unicode 0xB4, ACUTE ACCENT u'′': u"'", # Unicode 0x2032, PRIME u'`': u"'", # Unicode 0x60, GRAVE ACCENT u'\222': u"'", # Unicode 0x92: PRIVATE USE TWO u'“': u'"', u'”': u'"', u'<<': u'«', u'<<': u'«', u'>>': u'»', u'>>': u'»', u'l.l': u'l·l', u'l•l': u'l·l', u'l\225l': u'l·l', u'': u'·', u'–': u'-', # Unicode 0x2013, EN DASH u'—': u'-', # Unicode 0x2014, EM DASH u'‐': u'-', # Unicode 0x2010, HYPHEN } def replace_bad_chars(line): for bad_char in bad_chars: line = line.replace(bad_char, bad_chars[bad_char]) return line

rism-digital / muscat

Normalize punctuation on input #1599