glossarist / iev-data

1 stars 1 forks source link

Move text normalization features to String class #87

Closed skalee closed 3 years ago

skalee commented 3 years ago

Text imported from spreadsheets is often polluted with special Unicode characters like non-breaking spaces or zero-width white characters. Also, they contain HTML entities which need to be decoded. Right now we do it in several places like:

https://github.com/glossarist/iev-data/blob/d6be663241a9ed41e57ab8d08f8b93488a38cfce/lib/iev/termbase/term_builder.rb#L696

or:

https://github.com/glossarist/iev-data/blob/d6be663241a9ed41e57ab8d08f8b93488a38cfce/lib/iev/termbase/term_builder.rb#L49-L53

This is messy, untestable, and error-prone. Instead, let's refine String class and add normalization methods.

skalee commented 3 years ago

Fixed in #92.