Currently we are saving the entities that Twitter recognizes and delivers in the Tweet Object to save them to the database. The entity recognition Twitter does seems to be very rudimentary and sloppy. A lot of entities don't get recognized and there are redundancies because they are not disambiguated (e.g. NYC, New York and New York City are all individual entities instead of one).
I propose two steps for improving this:
1) Simple improvement: use spacy for NER
In the tests I did, spacy's out of the box NER proved to be already more reliable then what Twitter gives us:
2) Future improvements: develop a more robust approach that uses entity disambiguation and linking and also factors in some of Twitters peculiarities
it utilizes a knowledge base which we can maintain and update
we should add routines that pay respect to twitter, for instance:
when a named entity of the category person is parsed, look up if an account is also mentioned in the tweet, then check the account bio for a mention of the name -> establish that the person is connected to the mentioned account (this is the ”I love what Leo is doing right now” followed by a cat named Leo example)
ideally, we will have similar routines for all different entity types
I can help with the second step also, at least as soon as we have encapsulated the parsing in the API. I've written stuff like this in Python, just wouldn’t want to mess around with the app.
Currently we are saving the entities that Twitter recognizes and delivers in the Tweet Object to save them to the database. The entity recognition Twitter does seems to be very rudimentary and sloppy. A lot of entities don't get recognized and there are redundancies because they are not disambiguated (e.g. NYC, New York and New York City are all individual entities instead of one).
I propose two steps for improving this:
1) Simple improvement: use spacy for NER
2) Future improvements: develop a more robust approach that uses entity disambiguation and linking and also factors in some of Twitters peculiarities