Closed filmenczer closed 4 years ago
We discussed the issue. Chengcheng mentioned that the association table between tweets and urls (ass_tweet_url
) is not complete, since he discovered that the parsing failed on some tweets that do include URLs. So we cannot rely on it for purging the extra tweets. Instead, we should write an SQL query that checks whether the raw JSON includes a URLs or not. This can be done using PostgreSQL query language directly. Giovanni is going to work with Zac on how to do so.
Assigned @shaochengcheng again as he explained that this is a more complex job than we thought at first. We also need to delete tweet rows in which there is a URL but it is not to one of our sources. To do this we must join with the URL table, because the tweet table only has shortened URLs. However, as the old parser failed in some cases, there may be URLs that we want and that are not in the URL table. Therefore, first we need to check with the new parser if missing URLs need to be added. Then we can do the join and clear up the tweet table.
We should also check that if the tweet DOES contain a URL but it is not in a target domain, then that URL should not be parsed nor stored.
All that remains to be done is to clean up old tweets in DB without a URL
@shaochengcheng it seems we could proceed with this, now that the new tweet parser is in place. Can @chathuriw be of assistance?
In total, 10,747,535 tweets are deleted, two tables are affected that are tweet
and ass_tweet
.
No vacuum operation was taken, if we do need, please run it. Now I close this issue.
Per @shaochengcheng we can eliminate tweets without URLs in the existing database, and prevent storing them in the future.