huggingface / pytorch-openai-transformer-lm

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI
MIT License
1.51k stars 285 forks source link

Potentially incorrect regex in text_utils.py #48

Open schmmd opened 5 years ago

schmmd commented 5 years ago

Hi, we have some of your regex's in AllenNLP and Python has been warning us about them for a while.

https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/text_utils.py#L30

'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
In [38]: '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<ipython-input-38-9a7773b0447c>:1: DeprecationWarning: invalid escape sequence \?

In fixing them I looked to your implementation and noticed you prefixed the expressions with r so they are raw strings (presumably to fix the same warnings). However, I think this actually changed one of your regexes to something other than was intended.

# Before
$ '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
'(-+|~+|!+|"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)'

#After
$ r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
'(-+|~+|!+|"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)'

$ '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''' == r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
False

The switch to raw strings changed '|\\+' (one or more backslashes) to |\\\\+ (two or more backslashes). I think you actually want the following regex.

r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''

$ r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''' == '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
True