Closed glciampaglia closed 4 years ago
The purpose of tweet parsing is to split a tweet JSON data into several associated data structures, e.g., tweet, url, twitter_user, hashtags and etc. If we would like to save these data structures into a database, the relationships of tables should also be taken considered. For relationships like one-to-one (e.g., table tweet
vs. ass_tweet
) and one-to-many (e.g., table twitter_user
vs. table tweet
), we should finish the insertion of tables with no foreign key dependencies (e.g., table twitter_user
) because of the unknown of the foreign keys. After the insertion, the foreign keys become known, then we can use them to insert the depending tables (e.g., table tweet
). For many-to-many relationship (e.g., table tweet
vs. url
), things become a little more complicated. An intermediated table is needed (e.g., table ass_tweet_url
), which is many-to-one to the other two table that associate with it. To accomplish the insertion of URLs of a tweet, we need to first finish the insertion of the tweet
table and the url
table, then fetch these inserted primary id
s and insert them into ass_tweet_url
table correspondingly.
In twitter streaming, the tweets are received one by one. Thus, the straightforward idea is to parse and save a tweet one by one. In this implementation, the parsing and saving operations are working interactively. For example, when we parse the necessary data of the twitter_user
, we insert it and get its primary id
in the database. In the following parsing and saving operations, we will use this primary id
when necessary.
From above, we can see that saving the parsed objects from a tweet JSON data into the database require many database queries operations. The performance of the one-per-time implementation is very limited, which cannot consume large number tweets in a short time. In the twitter streaming, the consuming of tweets must be fast enough to keep the connection of the stream. To overcome it, in the current implementation, we use a queue to cache the coming tweets and then parse them. Even though, the one-per-time implementation would generate so many queries that may overload the shared database server. Moreover, sometimes we want to reparse the tweets (e.g., when we have requirements of new tables or bug fix), the performance of the parser would be the bottleneck.
Therefore, we propose this bulk implementation. In this implementation, the parsing and saving operations are separated. The parsing operation will split a tweet into different objects and have no interaction with the database. In this way, we can parse a large block of tweets and merge the parsed objects of the same kind together. For each kind of parsed objects, the saving operations would take the whole block and use one query to save them into the database. Please note that the saving operation should also take care of tables with foreign keys.
@shaochengcheng will change requirements; pandas, networkx and newspaper3k will stay '==' and others will change to '>='.
@chathuriw will update server after this update is pushed to master.
It seems everything is working fine; no errors. Closing. @shaochengcheng let us know if there are any remaining tasks related to this issue.
Switch to batch insertion