In order to really refine the output, we need to have more advanced parsing of the text. Immediate issues that call for this include:
Stripping out all links (requires parsing out relatively complex string patters).
Rejecting sequences that contain tweet-speek ("RT", "u r", etc).
Rejecting phrases with terms in the black list.
Thus, we'll either need to do more parsing on the python side or the Haskell side. Given (1) and the fact that I don't enjoy working in Python---Point made.
Todo
[ ] Modify python to return un-filtered unicode text of tweets, one per line.
Reasons for this change:
Todo