kaihendry / greptweet

Sane twitter backup and search
https://greptweet.com/
Other
76 stars 10 forks source link

Made... #49

Closed ghost closed 9 years ago

ghost commented 9 years ago

... a little tool to cleanup raw greptweet scrapes, and filter out various keywords...

perhaps link it somewhere in the readme or mention it elsewhere to get the word out?

https://github.com/SoHiggo/Twitter-Cleaner-

kaihendry commented 9 years ago

Hey David, do you have this up and running somewhere on a URL?

ghost commented 9 years ago

http://code.higg.im/twitter.cleaner/

kaihendry commented 9 years ago

Don't understand, I have to regex my http://greptweet.com/u/greptweet/greptweet.txt before using it?

Also could you explain what you are cleaning exactly? Why can't you work with the timestamp for e.g.?

Sorry for the silly questions!

ghost commented 9 years ago

Sorry - I built this for myself without much docs.

The REGEX basically strips away the timestamp info padded before each line / tweet. Like this stuff:

366855325497819137|Mon Aug 12 09:35:00 +0000 2013|

When that has been stripped, we pass into the cleaner.

The cleaner removes references to anything you want really - but I have customized it to remove certain sites like wall street journal, new york times, etc - for my own reasons.

I originally built it to cleanup a greptweet scraped from the Hackernews twitter accounts, which contain a mountain of links. Oftentimes the interesting stuff gets drowned by the 'popular'/common sites like NYT/ARS, etc...

Example of a raw dump here:

https://raw.githubusercontent.com/SoHiggo/Twitter-Cleaner-/master/sample/big.sample.txt

As you can see - a lot of noise (and the timestamps have been removed)

Note: Some unicode mangling here which is not there on purpose.

kaihendry commented 9 years ago

Ok, since this isn't an issue, I'm closing it.