gucorpling / amalgum

English web corpus with 4M tokens and several annotation types
25 stars 6 forks source link

Need process to eliminate bad spaces #4

Open amir-zeldes opened 4 years ago

amir-zeldes commented 4 years ago

Some abbreviations have inconsistent whitespace, for example spelling e. g. with space. The tokenizer should have some way of eliminating spaces in these based on a list in some file, possibly producing some annotation that indicates the original spelling (maybe sic+hi@rend="x-space"):

e.g.

Or adding an attribute with the original spelling (could do , though that is not really TEI)