tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.
GNU General Public License v3.0
135 stars 21 forks source link

Phonenumber and sad emoje #12

Closed max-otto closed 4 years ago

max-otto commented 4 years ago

Hey,

I found a small little bug: The regex: self.space_emoticon = re.compile(r'([:;])[ ]+([()])' can hit on some German telephone number formats such as: 'Tel: ( 0049)', 'Tel: (+49) In my code I simply fixed this by a negative lookahead, but I can't really test if this breaks something somewhere else. So mine right now is: self.space_emoticon = re.compile(r'([:;])[ ]+([()])(?! *[\+0])')

Btw. Thank you for this amazing tool. I use it really often. I really like that if there's something I don't understand, I can just climb down to the regex. Some more options would be nice though, but I guess one day I'll have to send a pull request ;)

tsproisl commented 4 years ago

Thank you! Pull requests are always very welcome :wink:!

I've modified the regular expression as you suggested and so far I didn't observe any negative side effects.