osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

Avoid duplicated article text in database #3

Closed glciampaglia closed 6 years ago

glciampaglia commented 6 years ago

Currently Hoaxy extracts all the hyperlinks in each tweet collected from the Twitter stream, and puts them in the url table. Hoaxy parses each raw URL so collected and stores the full HTML of each raw URL into the url table, along with its canonical URL. This creates a lot of duplicate content, and is not an efficient usage of space.

To overcome this, we will alter two tables. We will remove the html column from the url table, and add it to the article table, which is the one with the canonical URL of the article. PRIORITY: 2

Steps:

I (@shaochengcheng) am working on the second steps now. I prefer to handle it sololy, because there are so many small things to take care of.

glciampaglia commented 6 years ago

It looks like these tables are touched in many different parts. We should break down this task in multiple smaller tasks. @shaochengcheng please list here all the parts that require to be changed, so that we can split the work between the two of us.

shaochengcheng commented 6 years ago

Work is done. The update is running under server, let wait and see if there are problems.

glciampaglia commented 6 years ago

The space has been freed on the server, so closing.