stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Twitter preprocess script regex #107

Open thundo opened 6 years ago

thundo commented 6 years ago

I may be mistaken but isn't this regex to detect smiles from the Twitter preprocess script bugged?

.gsub(/#{eyes}#{nose}[)d]+|[)d]+#{nose}#{eyes}/i, "<SMILE>")

Shouldn't it be?

.gsub(/#{eyes}#{nose}[)d]+|[(d]+#{nose}#{eyes}/i, "<SMILE>")

(notice: round bracket direction for "mouth" of the second part)

skondrashov commented 6 years ago

I opened a new issue for this, then realized there were already open issues. Copying my post (mostly redundant to yours):

.gsub(/#{eyes}#{nose}[)d]+|[)d]+#{nose}#{eyes}/i, "<SMILE>")

This line's second part matches on )-: instead of (-:, and since this comes before the <SADFACE> section, that means every inverted sadface is actually tokenized as a <SMILE>. I'm not sure if the pretrained vectors were made with this mistake, and even if they were it probably doesn't affect too much, but mistakes in the tokenizer can propagate through the whole algorithm so it's kind of worrying to see them.

For this line specifically, I recommend: .gsub(/#{eyes}#{nose}[)D]+|\(+#{nose}#{eyes}/, "<SMILE>") Getting rid of the case insensitivity and adding a capital D to only the left-to-right smiley makes sense here. :-d and D-: are hardly smiles!

I feel silly posting issues about smiley tokenization, but for people trying to use the pretrained twitter vectors who have to use the tokenizer to get matching results, it's not really clear whether to fix the tokenizer, or accept that the pretrained vectors are (slightly) wrong and use the broken tokenizer to match. As far as I understand, using the pretrained vectors makes the best sense with the linked tokenizer, because it was the one that was used during training (strong assumption, please correct me if I'm wrong). I can post fully corrected regex with the list of issues that I've found if this generates interest and is the right place to open this issue (as well as a python version, which I'm sure would be helpful to someone).

skondrashov commented 6 years ago

@manning it would be very helpful to know if this tokenizer was the one used to train the twitter data, and which one was used if not!

thundo commented 6 years ago

Assuming that the vectors were trained with the "wrong" regex, I don't know if there's interest in correcting a widely used dataset. It may give (even slightly) inconsistent results if not properly tagged/versioned. I would welcome such a move, but can't speak for paper authors ;)

@tkondrashov glad I'm not the only one. Btw I would argue that smiles take many forms :-d What other issues did you find?

skondrashov commented 6 years ago

There's a bunch of minor/preferential stuff in the smileys like missing ] mouths, but some potentially more important ones:

1) The hashtag phrase splitting loses information. Hashtags function more like parentheses than anything, encapsulating the whole phrase that is a hashtag. It seems to me that the whole function is just misguided - <HASHTAG> womancrushwednesday is a more meaningful tokenization than <HASHTAG> woman crush wednesday. The best solution is to add an <ENDHASHTAG> to capture the meaning most precisely, but you would have to train your own vectors if you added that token.

2) Punctuation doesn't get split off at all as far as I can tell. The last word of punctuated sentences will have its meaning entirely overlooked cause it looks like word? instead of word ?. 3) Issue #121 mentions a couple more; the allcaps splitting is particularly bad. this sentencE consisTS OF WEiRD caPITalIZAtion becomes: this <allcaps> sentenc e <allcaps> consis ts of we <allcaps> i rd <allcaps> ca pit <allcaps> al iza <allcaps> tion I'm not sure that the allcaps token captures any meaning really, so I'm probably just going to get rid of it altogether in my use rather than try to fix that portion. 4) The <ELONG> token regex doesn't work at all for me, but that seems like too big an oversight, so maybe it's just something about porting it to Python. I'd double-check that it's working in Ruby though. EDIT: Yeah I had a mistake in my code unrelated to the regex. However, it still didn't seem to work right and (\S*?)(\w)\2+\b seems to work better. 5) The <REPEAT> regex doesn't distinguish between ....!!!!!!!!! and ........... I think that the first one should result in . <REPEAT> ! <REPEAT> rather than just . <REPEAT>, which can be easily done with ([!?.])\1+ instead of ([!?.]){2,}

A couple of other thoughts: 1) This is in the pretrained vectors which isn't directly related to what I'm talking about, but a cautionary tale: color="<hashtag> 2) Replacing all numbers with <NUMBER> obfuscates things like 911 which can contribute a lot to meaning. I don't know the reason for not keeping the number and still adding a number token, eg bush did 9/11 becomes bush did 9 <NUMBER> / 11 <NUMBER>, but it seems like it would capture more information and you could get rid of the number tokens that don't appear frequently as an optimization during training instead of just never looking at them to begin with.

skondrashov commented 6 years ago
import re

X = [
        u'http://foo.com/blah_blah http://foo.com/blah_blah/ http://foo.com/blah_blah_(wikipedia) https://foo_bar.example.com/',
        u':\\ :-/ =-( =`( )\'8 ]^;',
        u':) :-] =`) (\'8 ;`)',
        u':D :-D =`b d\'8 ;`P',
        u':| 8|',
        u'<3<3 <3 <3',
        u'#swag #swa00-= #as ## #WOOP #Feeling_Blessed #helloWorld',
        u'holy crap!! i won!!!!@@!!!',
        u'holy *IUYT$)(crap!! @@#i@%#@ swag.lord **won!!!!@@!!! wahoo....!!!??!??? Im sick lol.',
        u'this sentencE consisTS OF slAyYyyy slayyyyyy WEiRDd caPITalIZAtionn',
    ]

def sub(pattern, output, string, whole_word=False):
    token = output
    if whole_word:
        pattern = r'(\s|^)' + pattern + r'(\s|$)'

    if isinstance(output, basestring):
        token = ' ' + output + ' '
    else:
        token = lambda match: ' ' + output(match) + ' '

    return re.sub(pattern, token, string)

def hashtag(token):
    token = token.group('tag')
    if token != token.upper():
        token = ' '.join(re.findall('[a-zA-Z][^A-Z]*', token))

    return '<hashtag> ' + token + ' <endhashtag>'

def punc_repeat(token):
    return token.group(0)[0] + " <repeat>"

def punc_separate(token):
    return token.group()

def number(token):
    return token.group() + ' <number>';

def word_end_repeat(token):
    return token.group(1) + token.group(2) + ' <elong>'

eyes        = r"[8:=;]"
nose        = r"['`\-\^]?"
sad_front   = r"[(\[/\\]+"
sad_back    = r"[)\]/\\]+"
smile_front = r"[)\]]+"
smile_back  = r"[(\[]+"
lol_front   = r"[DbpP]+"
lol_back    = r"[d]+"
neutral     = r"[|]+"
sadface     = eyes + nose + sad_front   + '|' + sad_back   + nose + eyes
smile       = eyes + nose + smile_front + '|' + smile_back + nose + eyes
lolface     = eyes + nose + lol_front   + '|' + lol_back   + nose + eyes
neutralface = eyes + nose + neutral     + '|' + neutral    + nose + eyes
punctuation = r"""[ '!"#$%&'()+,/:;=?@_`{|}~\*\-\.\^\\\[\]]+""" ## < and > omitted to avoid messing up tokens

for tweet in X:
    tweet = sub(r'[\s]+',                             '  ',            tweet) # ensure 2 spaces between everything
    tweet = sub(r'(?:(?:https?|ftp)://|www\.)[^\s]+', '<url>',         tweet, True)
    tweet = sub(r'@\w+',                              '<user>',        tweet, True)
    tweet = sub(r'#(?P<tag>\w+)',                     hashtag,         tweet, True)
    tweet = sub(sadface,                              '<sadface>',     tweet, True)
    tweet = sub(smile,                                '<smile>',       tweet, True)
    tweet = sub(lolface,                              '<lolface>',     tweet, True)
    tweet = sub(neutralface,                          '<neutralface>', tweet, True)
    tweet = sub(r'(?:<3+)+',                          '<heart>',       tweet, True)
    tweet = tweet.lower()
    tweet = sub(r'[-+]?[.\d]*[\d]+[:,.\d]*',          number,          tweet, True)
    tweet = sub(punctuation,                          punc_separate,   tweet)
    tweet = sub(r'([!?.])\1+',                        punc_repeat,     tweet)
    tweet = sub(r'(\S*?)(\w)\2+\b',                   word_end_repeat, tweet)

    tweet = tweet.split()
    print(' '.join(tweet))

This is my version of the tokenization in python, hope it's useful to someone.