osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

Optimize tweet insertion #17

Closed glciampaglia closed 4 years ago

glciampaglia commented 6 years ago

Switch to batch insertion

shaochengcheng commented 6 years ago

Descriptions

The purpose of tweet parsing is to split a tweet JSON data into several associated data structures, e.g., tweet, url, twitter_user, hashtags and etc. If we would like to save these data structures into a database, the relationships of tables should also be taken considered. For relationships like one-to-one (e.g., table tweet vs. ass_tweet) and one-to-many (e.g., table twitter_user vs. table tweet), we should finish the insertion of tables with no foreign key dependencies (e.g., table twitter_user) because of the unknown of the foreign keys. After the insertion, the foreign keys become known, then we can use them to insert the depending tables (e.g., table tweet). For many-to-many relationship (e.g., table tweet vs. url), things become a little more complicated. An intermediated table is needed (e.g., table ass_tweet_url), which is many-to-one to the other two table that associate with it. To accomplish the insertion of URLs of a tweet, we need to first finish the insertion of the tweet table and the url table, then fetch these inserted primary ids and insert them into ass_tweet_url table correspondingly.

One Per Time Implementation

In twitter streaming, the tweets are received one by one. Thus, the straightforward idea is to parse and save a tweet one by one. In this implementation, the parsing and saving operations are working interactively. For example, when we parse the necessary data of the twitter_user, we insert it and get its primary id in the database. In the following parsing and saving operations, we will use this primary id when necessary.

Bulk Implementation

From above, we can see that saving the parsed objects from a tweet JSON data into the database require many database queries operations. The performance of the one-per-time implementation is very limited, which cannot consume large number tweets in a short time. In the twitter streaming, the consuming of tweets must be fast enough to keep the connection of the stream. To overcome it, in the current implementation, we use a queue to cache the coming tweets and then parse them. Even though, the one-per-time implementation would generate so many queries that may overload the shared database server. Moreover, sometimes we want to reparse the tweets (e.g., when we have requirements of new tables or bug fix), the performance of the parser would be the bottleneck.

Therefore, we propose this bulk implementation. In this implementation, the parsing and saving operations are separated. The parsing operation will split a tweet into different objects and have no interaction with the database. In this way, we can parse a large block of tweets and merge the parsed objects of the same kind together. For each kind of parsed objects, the saving operations would take the whole block and use one query to save them into the database. Please note that the saving operation should also take care of tables with foreign keys.

Working progress

filmenczer commented 4 years ago

@shaochengcheng will change requirements; pandas, networkx and newspaper3k will stay '==' and others will change to '>='.

filmenczer commented 4 years ago

@chathuriw will update server after this update is pushed to master.

filmenczer commented 4 years ago

It seems everything is working fine; no errors. Closing. @shaochengcheng let us know if there are any remaining tasks related to this issue.