Open jenniferward opened 5 months ago
If it helps, the list of the characters we systematically correct in our systems, because we have found them in our records, is this one (still in Python2; maybe copy and paste hasn't respected some of them, but the comment may help):
bad_chars = {
'\t': u' ',
' ': u' ',
u'': '', # Macintosh newline char?
u' ': u' ', # Unicode 0xA0, NO-BREAK SPACE
u' ': u' ', # Unicode 0x200E, LEFT-TO-RIGHT MARK
u'‘': u"'", # Unicode 0xA0, LEFT SINGLE QUOTATION MARK
u'’': u"'", # Unicode 0x2019, RIGHT SINGLE QUOTATION MARK
u'´': u"'", # Unicode 0xB4, ACUTE ACCENT
u'′': u"'", # Unicode 0x2032, PRIME
u'`': u"'", # Unicode 0x60, GRAVE ACCENT
u'\222': u"'", # Unicode 0x92: PRIVATE USE TWO
u'“': u'"',
u'”': u'"',
u'<<': u'«',
u'<<': u'«',
u'>>': u'»',
u'>>': u'»',
u'l.l': u'l·l',
u'l•l': u'l·l',
u'l\225l': u'l·l',
u'': u'·',
u'–': u'-', # Unicode 0x2013, EN DASH
u'—': u'-', # Unicode 0x2014, EM DASH
u'‐': u'-', # Unicode 0x2010, HYPHEN
}
def replace_bad_chars(line):
for bad_char in bad_chars:
line = line.replace(bad_char, bad_chars[bad_char])
return line
Some characters need to be normalized (smart quotes vs. apostrophes) but some need to be allowed (u vs ü). Currently,
’
and'
are read as different punctuation marks. This causes misalignment in city names in Institutions: La Seu d’Urgell https://rism.online/institutions/30079707 La Seu d'Urgell https://rism.online/institutions/30005481 and duplicates in Titles/Texts: https://muscat.rism.info/admin/standard_titles?utf8=%E2%9C%93&q%5Btitle_equals%5D=Au+sein+des+alarmes+l%E2%80%99amour+a+des+charmes&commit=Filter&order=id_desc Au sein des alarmes l’amour a des charmes Au sein des alarmes l'amour a des charmesThis arises especially when copying from websites or data imports. The problem has been solved with searching (see https://github.com/rism-digital/muscat/issues/622 ) but not on the input side.
I can think of the following:
’
and'
" "
and“ ”
-
–
—
(dash, n-dash, m-dash)For the dashes, only one is needed (the dash I think?) in the standardized fields.
What about spaces? Sometimes that acts strangely (Excel doesn't always read the spaces as spaces) but I can't describe it further than that.
This is most important the fields that are linked to authority files, not everywhere (like in notes fields).