Closed ghost closed 9 years ago
Hey David, do you have this up and running somewhere on a URL?
Don't understand, I have to regex my http://greptweet.com/u/greptweet/greptweet.txt before using it?
Also could you explain what you are cleaning exactly? Why can't you work with the timestamp for e.g.?
Sorry for the silly questions!
Sorry - I built this for myself without much docs.
The REGEX basically strips away the timestamp info padded before each line / tweet. Like this stuff:
366855325497819137|Mon Aug 12 09:35:00 +0000 2013|
When that has been stripped, we pass into the cleaner.
The cleaner removes references to anything you want really - but I have customized it to remove certain sites like wall street journal, new york times, etc - for my own reasons.
I originally built it to cleanup a greptweet scraped from the Hackernews twitter accounts, which contain a mountain of links. Oftentimes the interesting stuff gets drowned by the 'popular'/common sites like NYT/ARS, etc...
Example of a raw dump here:
https://raw.githubusercontent.com/SoHiggo/Twitter-Cleaner-/master/sample/big.sample.txt
As you can see - a lot of noise (and the timestamps have been removed)
Note: Some unicode mangling here which is not there on purpose.
Ok, since this isn't an issue, I'm closing it.
... a little tool to cleanup raw greptweet scrapes, and filter out various keywords...
perhaps link it somewhere in the readme or mention it elsewhere to get the word out?
https://github.com/SoHiggo/Twitter-Cleaner-