sportdb / sport.db

sport.db - open sports database (e.g. football.db, formula1.db etc.) command line tool and libraries
Creative Commons Zero v1.0 Universal
221 stars 29 forks source link

Potential faster gsub for unaccenting #9

Closed ioquatix closed 5 years ago

ioquatix commented 5 years ago
UNACCENT = {
  'Ä'=>'A',  'ä'=>'a',
  'Á'=>'A',  'á'=>'a',
  'É'=>'E',  'é'=>'e',
  'Í'=>'I',  'í'=>'i',
             'ï'=>'i',
  'Ñ'=>'N',  'ñ'=>'n',
  'Ö'=>'O',  'ö'=>'o',
  'Ó'=>'O',  'ó'=>'o',
             'ß'=>'ss',
  'Ü'=>'U',  'ü'=>'u',
  'Ú'=>'U',  'ú'=>'u',
}

PATTERN = Regexp.union(UNACCENT.keys)
def unaccent_gsub(text, mapping)
  text.gsub(PATTERN, mapping)
end

text = "Apples and AÄÁaäá EÉeé IÍiíï NÑnñ OÖÓoöó Ssß UÜÚuüú"

puts unaccent_gsub(text, UNACCENT)
ioquatix commented 5 years ago

mapping is provided as an argument while PATTERN is generated from the mapping used. So, in theory, it should probably be moved into the function. It depends on whether mapping is actually constant or not.

In that case, I'd suggest a class instance, to cache the PATTERN.

geraldb commented 5 years ago

Good point. I added your optimization in unaccent_gsub_3b and updated the benchmark and readme. Thanks. Cheers.