Implement Iterative DB Update

paul-tqh-nguyen commented 5 years ago

Instead of clearing the whole DB all at once when we gather all the data, we should just clear all the ones that correspond to a certain field after we've gathered the new docs relevant to that field; we would then immediately update the DB with the new docs for that field.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/3e811b42f4f3c75f0ad5e450b8e9e801f661eb0d

In efforts to implement iterative ETL updates, we've been manually testing our ETL process.

We've found and addressed these issues:

The main entry point for https://arxiv.org/ doesn't like bots (and we don't want to upset their admins), but their mirrored sites are fine with bots; we now hit the mirror sites instead. See the changes to _arxiv_base_url in extract_transform_utilities.py.
To solve the same problem, we now wait some number of seconds prior to hitting a URL. See SECONDS_TO_SLEEP_PRIOR_TO_HITTING_URL in extract_transform_utilities.py.
There were also issues with authentication, so we abstracted out functionality for handling that to a higher level. See changes to _arxiv_recent_papers_collection, _ensure_that_collection_is_valid_wrt_authentication, arxiv_recent_paper_docs_as_dicts, etc. in load_utilities.py.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/0e1b466eda9428bf5dab949e3556a7bb5a460e93

While working on #7, we realized that the main arXiv site was very sensitive to scraping and would occasionally block us.

We solved that in c60a2b2 by hitting the mirror sites instead, which are less likely to block us.

This patch updates the README to make that more clear.

paul-tqh-nguyen commented 5 years ago

At the current state of the project, the initial functionality is implemented for iterativity.

We'll close this issue for now and address bugs as they come up.

Manual testing will continue.

paul-tqh-nguyen / arxiv_as_a_newspaper

Implement Iterative DB Update #7