rism-digital / muscat

🗂️ A Rails application for the inventory of handwritten and printed music scores
http://muscat-project.org
34 stars 16 forks source link

Normalize punctuation on input #1599

Open jenniferward opened 5 months ago

jenniferward commented 5 months ago

Some characters need to be normalized (smart quotes vs. apostrophes) but some need to be allowed (u vs ü). Currently, and ' are read as different punctuation marks. This causes misalignment in city names in Institutions: La Seu d’Urgell https://rism.online/institutions/30079707 La Seu d'Urgell https://rism.online/institutions/30005481 and duplicates in Titles/Texts: https://muscat.rism.info/admin/standard_titles?utf8=%E2%9C%93&q%5Btitle_equals%5D=Au+sein+des+alarmes+l%E2%80%99amour+a+des+charmes&commit=Filter&order=id_desc Au sein des alarmes l’amour a des charmes Au sein des alarmes l'amour a des charmes

This arises especially when copying from websites or data imports. The problem has been solved with searching (see https://github.com/rism-digital/muscat/issues/622 ) but not on the input side.

I can think of the following:

For the dashes, only one is needed (the dash I think?) in the standardized fields.

What about spaces? Sometimes that acts strangely (Excel doesn't always read the spaces as spaces) but I can't describe it further than that.

This is most important the fields that are linked to authority files, not everywhere (like in notes fields).

fjorba commented 4 months ago

If it helps, the list of the characters we systematically correct in our systems, because we have found them in our records, is this one (still in Python2; maybe copy and paste hasn't respected some of them, but the comment may help):

bad_chars = {
    '\t': u' ',
    '
': u' ',
    u'': '', # Macintosh newline char?                                         
    u' ': u' ', # Unicode 0xA0, NO-BREAK SPACE                                  
    u' ': u' ', # Unicode 0x200E, LEFT-TO-RIGHT MARK                            
    u'‘': u"'", # Unicode 0xA0, LEFT SINGLE QUOTATION MARK                      
    u'’': u"'", # Unicode 0x2019, RIGHT SINGLE QUOTATION MARK                   
    u'´': u"'", # Unicode 0xB4, ACUTE ACCENT                                    
    u'′': u"'", # Unicode 0x2032, PRIME                                         
    u'`': u"'", # Unicode 0x60, GRAVE ACCENT                                    
    u'\222': u"'", # Unicode 0x92: PRIVATE USE TWO                              
    u'“': u'"',
    u'”': u'"',
    u'<<': u'«',
    u'&lt;&lt;': u'«',
    u'>>': u'»',
    u'&gt;&gt;': u'»',
    u'l.l': u'l·l',
    u'l•l': u'l·l',
    u'l\225l': u'l·l',
    u'&#61655;': u'·',
    u'–': u'-', # Unicode 0x2013, EN DASH                                       
    u'—': u'-', # Unicode 0x2014, EM DASH                                       
    u'‐': u'-', # Unicode 0x2010, HYPHEN                                        
}

def replace_bad_chars(line):
    for bad_char in bad_chars:
        line = line.replace(bad_char, bad_chars[bad_char])
    return line