osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

Reduce size of tweet table #16

Closed filmenczer closed 4 years ago

filmenczer commented 6 years ago

Per @shaochengcheng we can eliminate tweets without URLs in the existing database, and prevent storing them in the future.

glciampaglia commented 6 years ago

We discussed the issue. Chengcheng mentioned that the association table between tweets and urls (ass_tweet_url) is not complete, since he discovered that the parsing failed on some tweets that do include URLs. So we cannot rely on it for purging the extra tweets. Instead, we should write an SQL query that checks whether the raw JSON includes a URLs or not. This can be done using PostgreSQL query language directly. Giovanni is going to work with Zac on how to do so.

filmenczer commented 5 years ago

Assigned @shaochengcheng again as he explained that this is a more complex job than we thought at first. We also need to delete tweet rows in which there is a URL but it is not to one of our sources. To do this we must join with the URL table, because the tweet table only has shortened URLs. However, as the old parser failed in some cases, there may be URLs that we want and that are not in the URL table. Therefore, first we need to check with the new parser if missing URLs need to be added. Then we can do the join and clear up the tweet table.

ZacMilano commented 5 years ago

We should also check that if the tweet DOES contain a URL but it is not in a target domain, then that URL should not be parsed nor stored.

filmenczer commented 5 years ago

All that remains to be done is to clean up old tweets in DB without a URL

filmenczer commented 4 years ago

@shaochengcheng it seems we could proceed with this, now that the new tweet parser is in place. Can @chathuriw be of assistance?

shaochengcheng commented 4 years ago

In total, 10,747,535 tweets are deleted, two tables are affected that are tweet and ass_tweet.

No vacuum operation was taken, if we do need, please run it. Now I close this issue.