rooteco / tweetscape-streams

3 stars 3 forks source link

Entity parsing: replace native twitter entities with more robust solution #6

Open julianfleck opened 2 years ago

julianfleck commented 2 years ago

Currently we are saving the entities that Twitter recognizes and delivers in the Tweet Object to save them to the database. The entity recognition Twitter does seems to be very rudimentary and sloppy. A lot of entities don't get recognized and there are redundancies because they are not disambiguated (e.g. NYC, New York and New York City are all individual entities instead of one).

I propose two steps for improving this:

1) Simple improvement: use spacy for NER

Screen 2022-09-13 um 21 26 19

2) Future improvements: develop a more robust approach that uses entity disambiguation and linking and also factors in some of Twitters peculiarities

julianfleck commented 2 years ago

I can help with the second step also, at least as soon as we have encapsulated the parsing in the API. I've written stuff like this in Python, just wouldn’t want to mess around with the app.