stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Twitter preprocessing script #121

Open gombru opened 6 years ago

gombru commented 6 years ago

I wanted to use the Twitter preprocessing script in https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb and found a few bugs there:

  1. URLS without http are not found
  2. Last gsub splits words with caps where it should not and adds the word where it should not.

I think the script has not been tested, and probably is nto the one that was used to train the model, as discussed here https://groups.google.com/forum/#!searchin/globalvectors/preprocessing|sort:date/globalvectors/_X7hQBBuoLY/2ysMo1sWCQAJ

It's my first touch with Ruby but I've fixed those two bugs:

` def tokenize input

# Different regex parts for smiley faces
eyes = "[8:=;]"
nose = "['`\-]?"

input = input
    .gsub(/https?:\/\/\S+\b|www\.(\w+\.)+\S*/,"<URL>")
    .gsub(/www\.(\w+\.)+\S*/,"<URL>") # gombru: handle URLS without http
    .gsub("/"," / ") # Force splitting words appended with slashes (once we tokenized the URLs, of course)
    .gsub(/@\w+/, "<USER>")
    .gsub(/#{eyes}#{nose}[)d]+|[)d]+#{nose}#{eyes}/i, "<SMILE>")
    .gsub(/#{eyes}#{nose}p+/i, "<LOLFACE>")
    .gsub(/#{eyes}#{nose}\(+|\)+#{nose}#{eyes}/, "<SADFACE>")
    .gsub(/#{eyes}#{nose}[\/|l*]/, "<NEUTRALFACE>")
    .gsub(/<3/,"<HEART>")
    .gsub(/[-+]?[.\d]*[\d]+[:,.\d]*/, "<NUMBER>")
    .gsub(/#\S+/){ |hashtag| # Split hashtags on uppercase letters
        # TODO: also split hashtags with lowercase letters (requires more work to detect splits...)

        hashtag_body = hashtag[1..-1]
        if hashtag_body.upcase == hashtag_body
            result = "<HASHTAG> #{hashtag_body} <ALLCAPS>"
        else
            result = (["<HASHTAG>"] + hashtag_body.split(/(?=[A-Z])/)).join(" ")
        end
        result
    }
    .gsub(/([!?.]){2,}/){ # Mark punctuation repetitions (eg. "!!!" => "! <REPEAT>")
        "#{$~[1]} <REPEAT>"
    }
    .gsub(/\b(\S*?)(.)\2{2,}\b/){ # Mark elongated words (eg. "wayyyy" => "way <ELONG>")
        # TODO: determine if the end letter should be repeated once or twice (use lexicon/dict)
        $~[1] + $~[2] + " <ELONG>"
    }
    .gsub(/([^a-z0-9()<>'`\-]){1,}/){ |word|
        "#{word.downcase}" # gombru: Fixed bug, Downcasing all
    }

return input

end

puts tokenize($_) `

cyberosa commented 4 years ago

Hi, I know I arrive a bit late to this post, but just in case anyone might be interesting in collecting new tweets and clean them, I can share with you this python code. The cleaning code is in a jupyter notebook because I wanted to show and test each step visually. Hope you find it useful :)

https://github.com/cyberosa/read_tweets_python