Entity parsing: replace native twitter entities with more robust solution

Currently we are saving the entities that Twitter recognizes and delivers in the Tweet Object to save them to the database. The entity recognition Twitter does seems to be very rudimentary and sloppy. A lot of entities don't get recognized and there are redundancies because they are not disambiguated (e.g. NYC, New York and New York City are all individual entities instead of one).

I propose two steps for improving this:

1) Simple improvement: use spacy for NER

In the tests I did, spacy's out of the box NER proved to be already more reliable then what Twitter gives us:

2) Future improvements: develop a more robust approach that uses entity disambiguation and linking and also factors in some of Twitters peculiarities

in the future, we could use spacy's entity linking pipeline: https://spacy.io/api/entitylinker
it utilizes a knowledge base which we can maintain and update
we should add routines that pay respect to twitter, for instance:
- when a named entity of the category person is parsed, look up if an account is also mentioned in the tweet, then check the account bio for a mention of the name -> establish that the person is connected to the mentioned account (this is the ”I love what Leo is doing right now” followed by a cat named Leo example)
- ideally, we will have similar routines for all different entity types

rooteco / tweetscape-streams

Entity parsing: replace native twitter entities with more robust solution #6