paul-tqh-nguyen / arxiv_as_a_newspaper

arxiv.org portrayed as if it were a news paper.
0 stars 0 forks source link

Implement Iterative DB Update #7

Closed paul-tqh-nguyen closed 5 years ago

paul-tqh-nguyen commented 5 years ago

Instead of clearing the whole DB all at once when we gather all the data, we should just clear all the ones that correspond to a certain field after we've gathered the new docs relevant to that field; we would then immediately update the DB with the new docs for that field.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/3e811b42f4f3c75f0ad5e450b8e9e801f661eb0d

In efforts to implement iterative ETL updates, we've been manually testing our ETL process.

We've found and addressed these issues:

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/0e1b466eda9428bf5dab949e3556a7bb5a460e93

While working on #7, we realized that the main arXiv site was very sensitive to scraping and would occasionally block us.

We solved that in c60a2b2 by hitting the mirror sites instead, which are less likely to block us.

This patch updates the README to make that more clear.

paul-tqh-nguyen commented 5 years ago

At the current state of the project, the initial functionality is implemented for iterativity.

We'll close this issue for now and address bugs as they come up.

Manual testing will continue.